OptimizingVectorImageProcessing

From OTBWiki
Revision as of 15:46, 2 June 2015 by Luc.hermitte (Talk | contribs) (The problems: MT safety section filled.)

Jump to: navigation, search

I'll trace here the process I've followed to optimize the processing of an otb::VectorImage.

The context

We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction used from an otb::StreamingResampleImageFilter.

When the image processed is considered as an otb::Image, the execution take around 20 seconds. When the image processed is considered as an otb::VectorImage, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known be inefficient. Aren't they?

Let's see what callgrind has to say about this situation.

BeforeOptims-Warp.png

As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized(), we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements(), and 6 to delete[].

BeforeOptims-LinerInterp.png

(Actually a few optimizations have already been done, but they are not enough)

The problems

NB: In the following, VLV may be used as an abbreviation to itk::VariableLengthVector.

New pixel value => new pixel created

When dealing with Image Functions, ITK architecture imposes to override one function chosen among Evaluate(), EvaluateAtContinuousIndex, ... All functions return a OutputType.

This means that every time those functions are called, a new pixel value is returned. The image functions looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.

However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a itk::VariableLengthVector, every time an Evaluatexxx function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 rvalue references, each call to Evaluatexxx will produce a new pixel that'll be destroyed eventually at the end of the chain.

This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of new[] and delete[] calls observed thanks to the profiler.

Temporaries elimination

LinearInterpolateImageFunction::EvaluateOptimized overloads have a lot of linear computations on pixel values like: val00 + ( val10 - val00 ) * distance0. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.

This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.

NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.

Pixel casting => new pixel created

The Input Image type, the Output Image type and Real Image type used may differ. The profiling has been conducted on input and output images manipulated as VectorImage<uint16_t>, while the real image type is a VectorImage<double>.

The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of static_cast<OutputType>, and implicit conversion to RealType to ensure the correct transformations of pixel values.

When operated on objects, such casting will trigger call to converting constructors. Here, to:

  • VLV<OutputType::ValueType>(VLV<RealType::ValueType> const&);,
  • and to VLV<RealType::ValueType>(VLV<InputType::ValueType> const&);,

Note that the actual pixel types are not the same.

This is another sources of new[] and delete[] calls observed thanks to the profiler.

Pixel assignment => reallocation + copy of old values

In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from itk::VariableLengthVector. Alas, their default behaviour is to always reallocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.

Dynamic polymorphism

The EvaluateXxxx functions from Image Functions are all virtual. Indeed, interpolation functions (or other image functions) are a dynamic variation point. We need to be able to use one or the other.

This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an expression template (ET) -- without erasing the ET type.

Proxy Pixels

Some VLV pixels are quite peculiar. They are proxies to memory that do not belong to the pixel objects.

This is done with a dedicated constructor. What is peculiar is that once created, they are copied. Officially, they are not proxy any more. But actually, as RVO is likely to be triggered the pixel object obtained will still be a proxy object.

There isn't any official proxy pixel type, but only one pixel type. The fact proxies are involved is completely transparent. It's just that several functions like VLV::operator=(), VLV::SetSize(), will strip a pixel object from its proxy nature.

MT safety of cached pixel

Casting needs aside, when transforming expressions like

const RealType & valx0 = val00 + ( val10 - val00 ) * distance0;
const RealType & valx1 = val01 + ( val11 - val01 ) * distance0
return valx0 + ( valx1 - valx0 ) * distance1;

into

RealType valx0 = val10;
valx0 -= val00;
valx0 *= distance0;
valx0 += val00;
RealType valx1 = val11;
valx1 -= val01;
valx1 *= distance0;
valx1 += val01;
valx1 -= valx0;
valx1 *= distance1;
valx1 += valx0;
return valx1;

so that many temporaries could be eliminated, we can see there is still two temporary objects in the local scope: valx0, and valx1.

In order to also get rid of those objects, we could instead use cached objects, i.e.

RealType &valx0 = m_cachedValx0;
RealType &valx1 = m_cachedValx1;

Alas! This won't be thread safe. We need one cached valx0 object per thread.

At this point, i.e. an image function, there is no way to identify the current thread. We could use TLS, except the default threading scheme of ITK does not reuse threads -- thread pool is not yet, the default way. Why this is important ? Because it may induce creating more cached variables than the number of threads used in the warp filter. This is not what we are looking for (i.e. eliminating constructions of VLV pixels). Moreover, constructing a TLS object on the fly would be tricky on pre-C++11 compilers.


Note, that cached pixel variables would also be useful to store Input pixel values into RealType pixel values before any processing. This would also be useful into other image functions like otbImageToVectorImageCastFilter, or the BCO interpolator that creates and destroy matrices for each pixel processed.

The solutions

The results

Iterative arithmetic

The solution with:

  • thread-safe cached pixels,
  • iterative pixel arithmetic (i.e. m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
  • EvaluateOptimized() that returns pixel values through an [out] parameter.

runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.

AfterOptims-Warp.png

Expression Templates

Perspectives