Difference between revisions of "OptimizingVectorImageProcessing"
Luc.hermitte (Talk | contribs) m (→The context: English grammar) |
Luc.hermitte (Talk | contribs) (→The problems: First sections filled) |
||
Line 18: | Line 18: | ||
==== New pixel value => new pixel created ==== | ==== New pixel value => new pixel created ==== | ||
+ | |||
+ | When dealing with ''Image Functions'', ITK architecture imposes to override one function chosen among <code>Evaluate()</code>, <code>EvaluateAtContinuousIndex</code>, ... All functions return a <code>OutputType</code>. | ||
+ | |||
+ | This means that every time those functions are called, a new pixel value is returned. The ''image functions'' looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial. | ||
+ | |||
+ | However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a <code>itk::VariableLengthVector</code>, every time an <code>Evaluatexxx</code> function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 ''rvalue references'', each call to <code>Evaluatexxx</code> will produce a new pixel that'll be destroyed eventually at the end of the chain. | ||
+ | |||
+ | This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of <code>new[]</code> and <code>delete[]</code> calls observed thanks to the profiler. | ||
+ | |||
==== Temporaries elimination ==== | ==== Temporaries elimination ==== | ||
+ | |||
+ | <code>LinearInterpolateImageFunction::EvaluateOptimized</code> overloads have a lot of linear computations on pixel values like: | ||
+ | <code>val00 + ( val10 - val00 ) * distance0</code>. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object. | ||
+ | |||
+ | This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed. | ||
+ | |||
+ | NB: the above snapshot shows profiling results which are already clean of any temporary of this sort. | ||
+ | |||
==== Pixel casting => new pixel created ==== | ==== Pixel casting => new pixel created ==== | ||
+ | |||
+ | The ''Input Image'' type, the ''Output Image'' type and ''Real Image'' type used may differ. The profiling has been conducted on input and output images manipulated as <code>VectorImage<uint16_t></code>, while the ''real'' image type is a <code>VectorImage<double></code>. | ||
+ | |||
+ | The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of <code>static_cast<OutputType></code>, and implicit conversion to <code>RealType</code> to ensure the correct transformations of pixel values. | ||
+ | |||
+ | When operated on objects, such casting will trigger call to converting constructors. Here, to: | ||
+ | * <code>itk::VariableLengthVector<OutputType::ValueType>(itk::VariableLengthVector<RealType::ValueType> const&);</code>, | ||
+ | * and to <code>itk::VariableLengthVector<RealType::ValueType>(itk::VariableLengthVector<InputType::ValueType> const&);</code>, | ||
+ | |||
+ | Note that the actual pixel types are not the same. | ||
+ | |||
+ | This is another sources of <code>new[]</code> and <code>delete[]</code> calls observed thanks to the profiler. | ||
+ | |||
==== Pixel assignment => reallocation + copy of old values ==== | ==== Pixel assignment => reallocation + copy of old values ==== | ||
− | ==== | + | |
+ | In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from <code>itk::VariableLengthVector</code>. Alas, their default behaviour is to always '''re'''allocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten. | ||
+ | |||
+ | ==== Dynamic polymorphism ==== | ||
+ | |||
+ | The <code>EvaluateXxxx</code> functions from ''Image Functions'' are all virtual. Indeed, ''interpolation functions'' (or other image functions) are a dynamic variation point. We need to be able to use one or the other. | ||
+ | |||
+ | This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an ''expression template (ET)'' -- without ''erasing'' the ''ET'' type. | ||
+ | |||
==== Proxy Pixels ==== | ==== Proxy Pixels ==== | ||
==== MT safety of cached pixel ==== | ==== MT safety of cached pixel ==== |
Revision as of 13:55, 27 May 2015
I'll trace here the process I've followed to optimize the processing of an otb::VectorImage
.
The context
We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction
used from an otb::StreamingResampleImageFilter
.
When the image processed is considered as an otb::Image
, the execution take around 20 seconds. When the image processed is considered as an otb::VectorImage
, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known be inefficient. Aren't they?
Let's see what callgrind has to say about this situation.
As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized()
, we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements()
, and 6 to delete[]
.
(Actually a few optimizations have already been done, but they are not enough)
The problems
New pixel value => new pixel created
When dealing with Image Functions, ITK architecture imposes to override one function chosen among Evaluate()
, EvaluateAtContinuousIndex
, ... All functions return a OutputType
.
This means that every time those functions are called, a new pixel value is returned. The image functions looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.
However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a itk::VariableLengthVector
, every time an Evaluatexxx
function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 rvalue references, each call to Evaluatexxx
will produce a new pixel that'll be destroyed eventually at the end of the chain.
This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of new[]
and delete[]
calls observed thanks to the profiler.
Temporaries elimination
LinearInterpolateImageFunction::EvaluateOptimized
overloads have a lot of linear computations on pixel values like:
val00 + ( val10 - val00 ) * distance0
. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.
This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.
NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.
Pixel casting => new pixel created
The Input Image type, the Output Image type and Real Image type used may differ. The profiling has been conducted on input and output images manipulated as VectorImage<uint16_t>
, while the real image type is a VectorImage<double>
.
The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of static_cast<OutputType>
, and implicit conversion to RealType
to ensure the correct transformations of pixel values.
When operated on objects, such casting will trigger call to converting constructors. Here, to:
-
itk::VariableLengthVector<OutputType::ValueType>(itk::VariableLengthVector<RealType::ValueType> const&);
, - and to
itk::VariableLengthVector<RealType::ValueType>(itk::VariableLengthVector<InputType::ValueType> const&);
,
Note that the actual pixel types are not the same.
This is another sources of new[]
and delete[]
calls observed thanks to the profiler.
Pixel assignment => reallocation + copy of old values
In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from itk::VariableLengthVector
. Alas, their default behaviour is to always reallocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.
Dynamic polymorphism
The EvaluateXxxx
functions from Image Functions are all virtual. Indeed, interpolation functions (or other image functions) are a dynamic variation point. We need to be able to use one or the other.
This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an expression template (ET) -- without erasing the ET type.
Proxy Pixels
MT safety of cached pixel
The solutions
The results
Iterative arithmetic
The solution with:
- thread-safe cached pixels,
- iterative pixel arithmetic (i.e.
m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
-
EvaluateOptimized()
that returns pixel values through an [out] parameter.
runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.