OptimizingVectorImageProcessing
I'll trace here the process I've followed to optimize the processing of an otb::VectorImage
.
Contents
The context
We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction
used from an otb::StreamingResampleImageFilter
.
When the image processed is considered as an otb::Image
, the execution takes around 20 seconds. When the image processed is considered as an otb::VectorImage
, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known to be inefficient. Aren't they?
Let's see what callgrind has to say about this situation.
As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized()
, we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements()
, and 6 to delete[]
.
(Actually a few optimizations have already been done, but they are not enough.)
The problems
NB: In the following, VLV
may be used as an abbreviation to itk::VariableLengthVector
.
New pixel value => new pixel created
When dealing with Image Functions, ITK architecture imposes to override one function chosen among Evaluate()
, EvaluateAtContinuousIndex
, ... All functions return a OutputType
.
This means that every time those functions are called, a new pixel value is returned. The image functions looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.
However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a itk::VariableLengthVector
, every time an Evaluatexxx
function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 rvalue references, each call to Evaluatexxx
will produce a new pixel that'll be destroyed eventually at the end of the chain.
This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of new[]
and delete[]
calls observed thanks to the profiler.
Temporaries elimination
LinearInterpolateImageFunction::EvaluateOptimized
overloads have a lot of linear computations on pixel values like:
val00 + ( val10 - val00 ) * distance0
. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.
This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.
NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.
Pixel casting => new pixel created
The Input Image type, the Output Image type and Real Image type used may differ. The profiling has been conducted on input and output images manipulated as VectorImage<uint16_t>
, while the real image type is a VectorImage<double>
.
The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of static_cast<OutputType>
, and implicit conversion to RealType
to ensure the correct transformations of pixel values.
When operated on objects, such casting will trigger call to converting constructors. Here, to:
-
VLV<OutputType::ValueType>(VLV<RealType::ValueType> const&);
, - and to
VLV<RealType::ValueType>(VLV<InputType::ValueType> const&);
,
Note that the actual pixel types are not the same.
This is another sources of new[]
and delete[]
calls observed thanks to the profiler.
Pixel assignment => reallocation + copy of old values
In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from itk::VariableLengthVector
. Alas, their default behaviour is to always reallocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.
MT safety of cached pixel
Casting needs aside, when transforming expressions like
const RealType & valx0 = val00 + ( val10 - val00 ) * distance0; const RealType & valx1 = val01 + ( val11 - val01 ) * distance0 return valx0 + ( valx1 - valx0 ) * distance1;
into
RealType valx0 = val10; valx0 -= val00; valx0 *= distance0; valx0 += val00; RealType valx1 = val11; valx1 -= val01; valx1 *= distance0; valx1 += val01; valx1 -= valx0; valx1 *= distance1; valx1 += valx0; return valx1;
so that many temporaries could be eliminated, we can see there is still two temporary objects in the local scope: valx0
, and valx1
.
In order to also get rid of those objects, we could instead use cached objects, i.e.
RealType &valx0 = m_cachedValx0; RealType &valx1 = m_cachedValx1;
Alas! This won't be thread safe. We need one cached valx0 object per thread.
At this point, i.e. an image function, there is no way to identify the current thread. We could use TLS, except the default threading scheme of ITK does not reuse threads -- thread pool is not yet, the default way. Why this is important ? Because it may induce creating more cached variables than the number of threads used in the warp filter. This is not what we are looking for (i.e. eliminating constructions of VLV pixels). Moreover, constructing a TLS object on the fly would be tricky on pre-C++11 compilers.
Note, that cached pixel variables would also be useful to store Input pixel values into RealType pixel values before any processing. This would also be useful into other image functions like otbImageToVectorImageCastFilter, or the BCO interpolator that creates and destroy matrices for each pixel processed.
Dynamic polymorphism
The EvaluateXxxx
functions from Image Functions are all virtual. Indeed, interpolation functions (or other image functions) are a dynamic variation point. We need to be able to use one or the other.
This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an expression template (ET) -- without erasing the ET type.
Proxy Pixels
Some VLV pixels are quite peculiar. They are proxies to memory that do not belong to the pixel objects.
This is done with a dedicated constructor. What is peculiar is that once created, they are copied. Officially, they are not proxy any more. But actually, as RVO is likely to be triggered the pixel object obtained will still be a proxy object.
There isn't any official proxy pixel type, but only one pixel type. The fact proxies are involved is completely transparent. It's just that several functions like VLV::operator=()
, VLV::SetSize()
, will strip a pixel object from its proxy nature.
The solutions
...to (2.1) new pixels induced by EvaluateXxx
signature
Two approaches can be imagined.
- the complex one that doesn't change (much) the signature
We could prepare an object that isn't the final numeric result, but an expression to be evaluated, and its evaluation would store directly the computation result into the destination variable. In other words, we could return an expression template.
This would be really complex. Local values (that store other computations) may need to see their lifetime artificially extended. Moreover, as EvaluateXxxx
functions are virtual, we need to always return a same type. As a consequence, the exact type shall be erased (as in Type Erasure), which will likely induce a little performance degradation. This is the problem evoked §OptimizingVectorImageProcessing#Dynamic_polymorphism.
This approach doesn't seem viable.
- the simple one that alters the signature
The simplest approach consists in altering EvaluateXxxx
signature from:
virtual OutputType EvaluateXxxx(params)
to
virtual void EvaluateXxxx(params, OutputType & output)
The trade off?
- All
EvaluateXxxx
functions need to be changed to profit from the optimization (1) ; - The performances may be worsened on scalar types -- the cost will be negligible compared to the cost of calling a virtual function milliard times (2).
(1) Indeed, we can preserve the old signature while the code is being converted. e.g.
/*virtual*/ OutputType itk::ImageFunction::EvaluateXxxx(params) { OutputType output; this->EvaluateXxxx(params, output); return output; }
(2) However, we can imagine duplicating all functions: the classical one for scalar and array pixels, and the new one for VLV pixels. With a slight touch of meta-programming, we could have the compiler choose one type of EvaluateXxxx
to call. This idea seems idiotic. However, it would present a very simple way to propagate the current threadId
to EvaluateXxxx
functions that will always need to know the id of the current thread in order to use cached variables. See §OptimizingVectorImageProcessing#MT_safety_of_cached_pixel.
...to (2.2) unwanted temporaries
Discl. I'll address casting issues into another section.
Again, there are several classes of solutions to temporaries.
- C++11 rvalue-references
The first approach consists in giving up on old compilers and embracing 2011 C++ standard (aren't we in 2015?) and introducing move semantics through rvalue references.
The parts of VLV to tweak are its move-constructor and its move-assignment operator. And to make the change really efficient (as C++ compiler are already exploiting copy elision as much as possible), It's necessary to overload the binary arithmetic operators to exploit detected temporaries.
In order to know which operators to overload and how, a little test program has been written. The final related code can be found on ITK gerrit.
Pros:
- The client code does not need to be modified to take advantage of the speed up.
- Maintaining Image Functions is definitively easier
- The conversion work is simplified.
Cons:
- This requires to compile in C++11, or even in C++14 mode.
- Multiple subexpressions will imply the creation of temporaries.
- In all case, a new pixel will be created before being moved to its destination.
- Incremental computation of the results
As teased in §2.5,
const RealType & valx0 = val00 + ( val10 - val00 ) * distance0; const RealType & valx1 = val01 + ( val11 - val01 ) * distance0 return valx0 + ( valx1 - valx0 ) * distance1;
Can be rewritten into:
RealType valx0 = val10; valx0 -= val00; valx0 *= distance0; valx0 += val00; RealType valx1 = val11; valx1 -= val01; valx1 *= distance0; valx1 += val01; valx1 -= valx0; valx1 *= distance1; valx1 += valx0; return valx1;
Pros:
- We have here a complete control over the computations done.
- This is likely the fastest way to proceed on VLV objects. Indeed caches misses are more unlikely to happen than with the next approach -- on big vectors, which should not be our case.
Cons:
- The new way of doing the computations isn't exactly maintainable. It's quite the opposite actually.
- With this way of proceeding, there is no other way but to use cached variables. This is a consequence of §2.3 the required pixel castings. As a consequence, this need introduces a need for §2.5 MT safety.
- This approach implies to convert almost all expressions (to something unmaintainable)...
- Expression Templates
Historically, Expression Template is one of the first C++ meta-programming technique, if not the first.
Given an expression like val00 + ( val10 - val00 ) * distance0
, it will not compute the expression, but build an object that stores all the computation required. It's when the result is assigned, that the computation is really done. When matrices are involved, this technique can become really complex. In our case, we are working with vectors.
Unlike the previous approach that will first compute val10 - vall00
and so on, this time, when the expression is evaluated, it's done element after element. i.e. First we compute, and store, val00[0] + ( val10[0] - val00[0] ) * distance0
, then we do the same on index 1, and so on. That's why it may be less cache-friendly on big vectors. Fortunately, this should not inconvenience us.
Pros:
- Maintaining Image Functions is definitively easier
- The conversion work is simplified as there is none to be done in the computation done on pixel values.
- No new temporaries are required for sub-expressions.
- A complex computation may be done and assigned to an existing VLV without implying any creation of VLV.
Cons:
- Possibly less cache-friendly on situations that should not be ours -- to be benchmarked!
- Introducing Expression Templates (ET) into
itk::VariableLengthVector
may seems a little pointless as many other FOSS like Eigen, or evenstd::valarray
already profit from intensively used and tested ET optimization technique -- that unfortunately won't permit to have proxy pixels. - The ET introduction shall also take advantage of C++11 rvalue references.
- A same casting on an element (from a pixel value) from InputPixelType to RealPixelType may be done several times.
- It still is complex to factorize some computations without caching some allocations. i.e.:
const RealType & valx10 = val010 + ( val110 - val010 ) * distance0; const RealType & valxx0 = valx00 + ( valx10 - valx00 ) * distance1; const RealType & valx01 = val001 + ( val101 - val001 ) * distance0; const RealType & valx11 = val011 + ( val111 - val011 ) * distance0; const RealType & valxx1 = valx01 + ( valx11 - valx01 ) * distance1; return valxx0 + ( valxx1 - valxx0 ) * distance2;
C++11 auto
would help, alas ITK and OTB code shall remain compatible with old pre-C++11 compilers, even in 2015.
...to (2.3) Pixels casting
(TBD)
...to (2.4) Pixels assignment
The need is to have an assignment operator that doesn't resize the VLV when this is not required. Moreover, old values don't need to be kept as they'll be overwritten immediately.
The solution adopted is to provide a new overload to SetSize
that lets the end-user choose precisely two policies:
- the Reallocating Policy among:
AlwaysReallocate
(as the name suggest),NeverReallocate
(as the name suggest),ShrintToFit
(reallocate only if the size changes),DontShrinkToFit
(reallocate only if the new size is bigger thanGetSize()
current result). - the Old Value Keeping Policy:
KeepOldValues
(that will copy old values from the old internal buffer to the new internal buffer -- in case of reallocation),DumpOldValue
(that ensures nothing).
Thanks to this new SetSize()
overload, the assignment operator can be written mostly as
const unsigned int N = rhs.GetSize(); this->SetSize(N, DontShrinkToFit(), DumpOldValues()); std::copy(&rhs.m_Data[0], &rhs.m_Data[N], &this->m_Data[0]);
Here:
- What
SetSize()
does is quite explicit and without any ambiguity. -
SetSize()
ensures the current VLV object isn't a proxy any more -- if it was. - The anti auto-assignment pessimistic test isn't done.
For cases we know the current VLV already has the right size, and isn't a proxy, a new FastAssign
member function has been provided. This alternative to the assignment operator will be even faster in cases where we can set the dimension of the various pixels (cached or not) as soon as BeforeThreadedGenerateData()
. Alas it'll be cumbersome to use.
See the related patch pushed to gerrit.
...to (2.5) MT-safety issues
We though about several possible ways to ensure MT-satefy on function images that need to store MT specific data.
TLS
Using TLS to store objects that'll need construction in a C++03 world will be definitively cumbersome. Moreover, it may prove costly with current threading policies. Using the experimental thread pool will reduce the number of objects to build, but still there'll be more threads actually started and stopped/joined than the actual number of threads running simultaneously.
TLS could be used, but for simple types, like ThreadId
s.
The solution could be:
- to call the virtual (?)
ImageFunction::NotifyNumberOfThreads(ThreadId)
fromBeforeThreadedGenerateData()
-- this will permit to construct as many cached variables as required. - and to call the virtual (?)
ImageFunction::SetCurrentThreadId(ThreadId)
at the very start ofThreadGenerateData()
(i.e. outside the loop on pixels).ImageFunction::SetCurrentThreadId(ThreadId)
would set the TLS ThreadId attribute of the Image Function object.
Pro
- The signature of
EvaluateXxxx()
won't need to receive the current threadId as a parameter. - No need to have clonable and CopyConstructible image functions.
Cons
- OS/compiler specific code will need to be defined to support TLS.
ThreadId Parameter
The simplistic solution would be:
- to call the virtual (?)
ImageFunction::NotifyNumberOfThreads(ThreadId)
fromBeforeThreadedGenerateData()
-- this need stays - and to have a currentThreadId parameter in
EvaluateXxxx()
Pro
- No OS/compiler specific code needs to be defined to support TLS.
- No need to have clonable and CopyConstructible image functions.
Cons
- The signature of
EvaluateXxxx()
does need to receive the current threadId as a parameter -- this may induce a little (hopefully negligible) slow down.
MTAware filters, and MTUnaware image functions
Another solution would have been to leave the image functions MT un-safe/un-aware. i.e. we wouldn't try to have a (set of) cached variable(s) per thread. The MT-awareness would thus need to be elsewhere: on the filters that would call the newly optimized image functions.
The typical solution would be to have one image function per thread.
Alas, a new problem arise. We can't define a SetThatImageFunction(ThreadId, ThatImageFunctionPointer)
and expect the end user to call the function once per number of threads activated on the filter.
The neater approach (regarding end user's prospects) would be to continue with the simple SetThatImageFunction(ThatImageFunctionPointer)
and have this function clone the parameter as many time as required (i.e. once per thread).
The problem? Cloning means copy-constructors are required. Unfortunately image functions cannot be assigned nor copied. This solution will require some important changes.
Pro
- No OS/compiler specific code needs to be defined to support TLS.
- The signature of
EvaluateXxxx()
won't need to receive the current threadId as a parameter.
Cons
- All
ImageFunction
s shall support copy-construction (protected would be perfect), and cloning.
The results
C++11 Move semantics
Instead of running in 120 seconds, the execution takes between 85 and 90 seconds. This is not perfect, there is still one, unneeded, allocation per iteration (i.e. par pixel)
Iterative arithmetic
The solution with:
- thread-safe cached pixels,
- iterative pixel arithmetic (i.e.
m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
-
EvaluateOptimized()
that returns pixel values through an [out] parameter.
runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.