Difference between revisions of "OptimizingVectorImageProcessing"

From OTBWiki
Jump to: navigation, search
m (The context: English grammar)
(The problems: First sections filled)
Line 18: Line 18:
  
 
==== New pixel value => new pixel created ====
 
==== New pixel value => new pixel created ====
 +
 +
When dealing with ''Image Functions'', ITK architecture imposes to override one function chosen among <code>Evaluate()</code>, <code>EvaluateAtContinuousIndex</code>, ... All functions return a <code>OutputType</code>.
 +
 +
This means that every time those functions are called, a new pixel value is returned. The ''image functions'' looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.
 +
 +
However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a <code>itk::VariableLengthVector</code>, every time an <code>Evaluatexxx</code> function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 ''rvalue references'', each call to <code>Evaluatexxx</code> will produce a new pixel that'll be destroyed eventually at the end of the chain.
 +
 +
This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of <code>new[]</code> and <code>delete[]</code> calls observed thanks to the profiler.
 +
 
==== Temporaries elimination ====
 
==== Temporaries elimination ====
 +
 +
<code>LinearInterpolateImageFunction::EvaluateOptimized</code> overloads have a lot of linear computations on pixel values like:
 +
<code>val00 + ( val10 - val00 ) * distance0</code>. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.
 +
 +
This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.
 +
 +
NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.
 +
 
==== Pixel casting => new pixel created ====
 
==== Pixel casting => new pixel created ====
 +
 +
The ''Input Image'' type, the ''Output Image'' type and ''Real Image'' type used may differ. The profiling has been conducted on input and output images manipulated as <code>VectorImage<uint16_t></code>, while the ''real'' image type is a <code>VectorImage<double></code>.
 +
 +
The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of <code>static_cast<OutputType></code>, and implicit conversion to <code>RealType</code> to ensure the correct transformations of pixel values.
 +
 +
When operated on objects, such casting will trigger call to converting constructors. Here, to:
 +
* <code>itk::VariableLengthVector<OutputType::ValueType>(itk::VariableLengthVector<RealType::ValueType> const&);</code>,
 +
* and to <code>itk::VariableLengthVector<RealType::ValueType>(itk::VariableLengthVector<InputType::ValueType> const&);</code>,
 +
 +
Note that the actual pixel types are not the same.
 +
 +
This is another sources of <code>new[]</code> and <code>delete[]</code> calls observed thanks to the profiler.
 +
 
==== Pixel assignment => reallocation + copy of old values ====
 
==== Pixel assignment => reallocation + copy of old values ====
==== dynamic polymorphism ====
+
 
 +
In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from <code>itk::VariableLengthVector</code>. Alas, their default behaviour is to always '''re'''allocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.
 +
 
 +
==== Dynamic polymorphism ====
 +
 
 +
The <code>EvaluateXxxx</code> functions from ''Image Functions'' are all virtual. Indeed, ''interpolation functions'' (or other image functions) are a dynamic variation point. We need to be able to use one or the other.
 +
 
 +
This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an ''expression template (ET)'' -- without ''erasing'' the ''ET'' type.
 +
 
 
==== Proxy Pixels ====
 
==== Proxy Pixels ====
 
==== MT safety of cached pixel ====
 
==== MT safety of cached pixel ====

Revision as of 13:55, 27 May 2015

I'll trace here the process I've followed to optimize the processing of an otb::VectorImage.

The context

We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction used from an otb::StreamingResampleImageFilter.

When the image processed is considered as an otb::Image, the execution take around 20 seconds. When the image processed is considered as an otb::VectorImage, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known be inefficient. Aren't they?

Let's see what callgrind has to say about this situation.

BeforeOptims-Warp.png

As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized(), we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements(), and 6 to delete[].

BeforeOptims-LinerInterp.png

(Actually a few optimizations have already been done, but they are not enough)

The problems

New pixel value => new pixel created

When dealing with Image Functions, ITK architecture imposes to override one function chosen among Evaluate(), EvaluateAtContinuousIndex, ... All functions return a OutputType.

This means that every time those functions are called, a new pixel value is returned. The image functions looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.

However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a itk::VariableLengthVector, every time an Evaluatexxx function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 rvalue references, each call to Evaluatexxx will produce a new pixel that'll be destroyed eventually at the end of the chain.

This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of new[] and delete[] calls observed thanks to the profiler.

Temporaries elimination

LinearInterpolateImageFunction::EvaluateOptimized overloads have a lot of linear computations on pixel values like: val00 + ( val10 - val00 ) * distance0. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.

This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.

NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.

Pixel casting => new pixel created

The Input Image type, the Output Image type and Real Image type used may differ. The profiling has been conducted on input and output images manipulated as VectorImage<uint16_t>, while the real image type is a VectorImage<double>.

The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of static_cast<OutputType>, and implicit conversion to RealType to ensure the correct transformations of pixel values.

When operated on objects, such casting will trigger call to converting constructors. Here, to:

  • itk::VariableLengthVector<OutputType::ValueType>(itk::VariableLengthVector<RealType::ValueType> const&);,
  • and to itk::VariableLengthVector<RealType::ValueType>(itk::VariableLengthVector<InputType::ValueType> const&);,

Note that the actual pixel types are not the same.

This is another sources of new[] and delete[] calls observed thanks to the profiler.

Pixel assignment => reallocation + copy of old values

In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from itk::VariableLengthVector. Alas, their default behaviour is to always reallocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.

Dynamic polymorphism

The EvaluateXxxx functions from Image Functions are all virtual. Indeed, interpolation functions (or other image functions) are a dynamic variation point. We need to be able to use one or the other.

This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an expression template (ET) -- without erasing the ET type.

Proxy Pixels

MT safety of cached pixel

The solutions

The results

Iterative arithmetic

The solution with:

  • thread-safe cached pixels,
  • iterative pixel arithmetic (i.e. m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
  • EvaluateOptimized() that returns pixel values through an [out] parameter.

runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.

AfterOptims-Warp.png

Expression Templates

Perspectives