Difference between revisions of "OptimizingVectorImageProcessing"

From OTBWiki
Jump to: navigation, search
(The problems: MT safety section filled.)
(The solutions: Signature issue & Temporaries elimination)
Line 100: Line 100:
  
 
=== The solutions ===
 
=== The solutions ===
 +
 +
==== ...to (2.1) new pixels induced by <code>EvaluateXxx</code> signature ====
 +
Two approaches can be imagined.
 +
 +
===== - the complex one that doesn't change (much) the signature =====
 +
We could prepare an object that isn't the final numeric result, but an expression to be evaluated, and its evaluation would store directly the computation result into the destination variable. In other words, we could return an ''expression template''.
 +
 +
This would be really complex. Local values (that store other computations) may need to see their lifetime artificially extended. Moreover, as <code>EvaluateXxxx</code> functions are virtual, we need to always return a same type. As a consequence, the exact type shall be erased (as in ''Type Erasure''), which will likely induce a little performance degradation. This is the problem evoked §[[OptimizingVectorImageProcessing#Dynamic_polymorphism]].
 +
 +
This approach doesn't seem viable.
 +
 +
===== - the simple one that alters the signature =====
 +
The simplest approach consists in altering <code>EvaluateXxxx</code> signature from:
 +
<pre>virtual OutputType EvaluateXxxx(params)</pre>
 +
to
 +
<pre>virtual void EvaluateXxxx(params, OutputType & output)</pre>
 +
 +
======The trade off? ======
 +
* All <code>EvaluateXxxx</code> functions need to be changed to profit from the optimization (1) ;
 +
* The performances may be worsened on scalar types -- the cost will be negligible compared to the cost of calling a virtual function milliard times (2).
 +
 +
 +
(1) Indeed, we can preserve the ''old'' signature while the code is being converted. e.g.
 +
<pre>/*virtual*/ OutputType itk::ImageFunction::EvaluateXxxx(params) {
 +
    OutputType output;
 +
    this->EvaluateXxxx(params, output);
 +
    return output;
 +
}</pre>
 +
 +
(2) However, we can imagine duplicating all functions: the classical one for scalar and array pixels, and the new one for VLV pixels. With a slight touch of meta-programming, we could have the compiler choose one type of <code>EvaluateXxxx</code> to call. This idea seems idiotic. However, it would present a very simple way to propagate the current <code>threadId</code> to <code>EvaluateXxxx</code> functions that will always need to know the id of the current thread in order to use cached variables. See §[[OptimizingVectorImageProcessing#MT_safety_of_cached_pixel]].
 +
 +
 +
==== ...to (2.2) unwanted temporaries ====
 +
''Discl. I'll address casting issues into another section.''
 +
 +
Again, there are two classes of solutions to temporaries.
 +
 +
===== - Incremental computation of the results =====
 +
As teased in [[OptimizingVectorImageProcessing#MT_safety_of_cached_pixel|§2.7]],
 +
<pre>const RealType & valx0 = val00 + ( val10 - val00 ) * distance0;
 +
const RealType & valx1 = val01 + ( val11 - val01 ) * distance0
 +
return valx0 + ( valx1 - valx0 ) * distance1;</pre>
 +
Can be rewritten into:
 +
<pre>RealType valx0 = val10;
 +
valx0 -= val00;
 +
valx0 *= distance0;
 +
valx0 += val00;
 +
RealType valx1 = val11;
 +
valx1 -= val01;
 +
valx1 *= distance0;
 +
valx1 += val01;
 +
valx1 -= valx0;
 +
valx1 *= distance1;
 +
valx1 += valx0;
 +
return valx1;</pre>
 +
 +
====== Pros: ======
 +
* We have here a complete control over the computations done.
 +
* This is likely the fastest way to proceed on VLV objects. Indeed caches misses are more unlikely to happen than with the next approach -- on big vectors, which should not be our case.
 +
 +
====== Cons: ======
 +
* The ''new'' way of doing the computations isn't exactly maintainable. It's quite the opposite actually.
 +
* With this way of proceeding, there is no other way but to use cached variables. This is a consequence of [[OptimizingVectorImageProcessing#Pixel_casting_.3D.3E_new_pixel_created|§2.3 the required pixel castings]]. As a consequence, this need introduces a need for [[OptimizingVectorImageProcessing#MT_safety_of_cached_pixel|§2.7 MT safety]].
 +
 +
===== - Expression Templates =====
 +
 +
Historically, [http://en.wikipedia.org/wiki/Expression_templates ''Expression Template''] is one of the first C++ meta-programming technique, if not the first.
 +
 +
Given an expression like <code>val00 + ( val10 - val00 ) * distance0</code>, it will not compute the expression, but build an object that stores all the computation required. It's when the result is assigned, that the computation is really done. When matrices are involved, this technique can become really complex. In our case, we are working with vectors.
 +
 +
Unlike the previous approach that will first compute <code>val10 - vall00</code> and so on, this time, when the expression is evaluated, it's done element after element. i.e. First we compute, and store, <code>val00[0] + ( val10[0] - val00[0] ) * distance0</code>, then we do the same on index 1, and so on. That's why it may be less cache-friendly on big vectors. Fortunately, this should not inconvenience us.
 +
 +
====== Pros: ======
 +
* Maintaining Image Functions is definitively easier
 +
* The conversion work is simplified.
 +
 +
====== Cons: ======
 +
* Possibly less cache-friendly on situations that should not be ours -- to be benchmarked!
 +
* Introducing Expression Templates (ET) into <code>itk::VariableLengthVector</code> may seems a little pointless as many other FOSS like Eigen, or even <code>std::valarray</code> already profit from intensively used and tested ET optimization technique -- that unfortunately won't permit to have proxy pixels.
 +
* The ET introduction shall also take advantage of C++11 rvalue references.
 +
* A same casting on an element (from a pixel value) from InputPixelType to RealPixelType may be done several times.
 +
 +
==== ...to (2.3) Pixels casting ====
 +
 
=== The results ===
 
=== The results ===
  

Revision as of 17:27, 2 June 2015

I'll trace here the process I've followed to optimize the processing of an otb::VectorImage.

The context

We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction used from an otb::StreamingResampleImageFilter.

When the image processed is considered as an otb::Image, the execution take around 20 seconds. When the image processed is considered as an otb::VectorImage, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known be inefficient. Aren't they?

Let's see what callgrind has to say about this situation.

BeforeOptims-Warp.png

As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized(), we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements(), and 6 to delete[].

BeforeOptims-LinerInterp.png

(Actually a few optimizations have already been done, but they are not enough)

The problems

NB: In the following, VLV may be used as an abbreviation to itk::VariableLengthVector.

New pixel value => new pixel created

When dealing with Image Functions, ITK architecture imposes to override one function chosen among Evaluate(), EvaluateAtContinuousIndex, ... All functions return a OutputType.

This means that every time those functions are called, a new pixel value is returned. The image functions looks a lot like normal functions. Most of the time (always?) they are pure functions (with no state) which take input parameters and return a new value. They are easy to write and to reason about. And when they are pure functions, calling them safely from multiple threads is trivial.

However, when the pixel type is not a scalar type (short, double, ...), these functions will create a new pixel value, and return it by copy (-construction). When the pixel type is not responsible for a resource (like allocated memory) the performances won't be impacted much -- as compilers are able to optimize returned values. Alas, when the pixel type is a itk::VariableLengthVector, every time an Evaluatexxx function is called a new pixel has to be created. Even when the compiler is able to elude copies (with RVO or RNVO), even with C++11 rvalue references, each call to Evaluatexxx will produce a new pixel that'll be destroyed eventually at the end of the chain.

This means that with this architecture, for every pixel processed, a new pixel object is created and then destroyed. This is one of the sources of new[] and delete[] calls observed thanks to the profiler.

Temporaries elimination

LinearInterpolateImageFunction::EvaluateOptimized overloads have a lot of linear computations on pixel values like: val00 + ( val10 - val00 ) * distance0. When the pixel type is not a scalar type, each operation (+, -, *, and /) will consume its operands and produce a new pixel object.

This expression will produce three new pixel objects, two being required for holding temporary results. This means, 3 more allocations and release per pixel processed.

NB: the above snapshot shows profiling results which are already clean of any temporary of this sort.

Pixel casting => new pixel created

The Input Image type, the Output Image type and Real Image type used may differ. The profiling has been conducted on input and output images manipulated as VectorImage<uint16_t>, while the real image type is a VectorImage<double>.

The consequences are that pixel values have to be converted along the computation chain. The image function code contains uses of static_cast<OutputType>, and implicit conversion to RealType to ensure the correct transformations of pixel values.

When operated on objects, such casting will trigger call to converting constructors. Here, to:

  • VLV<OutputType::ValueType>(VLV<RealType::ValueType> const&);,
  • and to VLV<RealType::ValueType>(VLV<InputType::ValueType> const&);,

Note that the actual pixel types are not the same.

This is another sources of new[] and delete[] calls observed thanks to the profiler.

Pixel assignment => reallocation + copy of old values

In order to restrict the number of temporaries, or of conversions, we could make use of the assignment operators from itk::VariableLengthVector. Alas, their default behaviour is to always reallocate, and to maintain the old values, even when the current memory buffer could be reused, and (/resp.) when the old values will be overwritten.

Dynamic polymorphism

The EvaluateXxxx functions from Image Functions are all virtual. Indeed, interpolation functions (or other image functions) are a dynamic variation point. We need to be able to use one or the other.

This implies that the type of the returned value shall remain stable among all overrides. In other words, we could not return an expression template (ET) -- without erasing the ET type.

Proxy Pixels

Some VLV pixels are quite peculiar. They are proxies to memory that do not belong to the pixel objects.

This is done with a dedicated constructor. What is peculiar is that once created, they are copied. Officially, they are not proxy any more. But actually, as RVO is likely to be triggered the pixel object obtained will still be a proxy object.

There isn't any official proxy pixel type, but only one pixel type. The fact proxies are involved is completely transparent. It's just that several functions like VLV::operator=(), VLV::SetSize(), will strip a pixel object from its proxy nature.

MT safety of cached pixel

Casting needs aside, when transforming expressions like

const RealType & valx0 = val00 + ( val10 - val00 ) * distance0;
const RealType & valx1 = val01 + ( val11 - val01 ) * distance0
return valx0 + ( valx1 - valx0 ) * distance1;

into

RealType valx0 = val10;
valx0 -= val00;
valx0 *= distance0;
valx0 += val00;
RealType valx1 = val11;
valx1 -= val01;
valx1 *= distance0;
valx1 += val01;
valx1 -= valx0;
valx1 *= distance1;
valx1 += valx0;
return valx1;

so that many temporaries could be eliminated, we can see there is still two temporary objects in the local scope: valx0, and valx1.

In order to also get rid of those objects, we could instead use cached objects, i.e.

RealType &valx0 = m_cachedValx0;
RealType &valx1 = m_cachedValx1;

Alas! This won't be thread safe. We need one cached valx0 object per thread.

At this point, i.e. an image function, there is no way to identify the current thread. We could use TLS, except the default threading scheme of ITK does not reuse threads -- thread pool is not yet, the default way. Why this is important ? Because it may induce creating more cached variables than the number of threads used in the warp filter. This is not what we are looking for (i.e. eliminating constructions of VLV pixels). Moreover, constructing a TLS object on the fly would be tricky on pre-C++11 compilers.


Note, that cached pixel variables would also be useful to store Input pixel values into RealType pixel values before any processing. This would also be useful into other image functions like otbImageToVectorImageCastFilter, or the BCO interpolator that creates and destroy matrices for each pixel processed.

The solutions

...to (2.1) new pixels induced by EvaluateXxx signature

Two approaches can be imagined.

- the complex one that doesn't change (much) the signature

We could prepare an object that isn't the final numeric result, but an expression to be evaluated, and its evaluation would store directly the computation result into the destination variable. In other words, we could return an expression template.

This would be really complex. Local values (that store other computations) may need to see their lifetime artificially extended. Moreover, as EvaluateXxxx functions are virtual, we need to always return a same type. As a consequence, the exact type shall be erased (as in Type Erasure), which will likely induce a little performance degradation. This is the problem evoked §OptimizingVectorImageProcessing#Dynamic_polymorphism.

This approach doesn't seem viable.

- the simple one that alters the signature

The simplest approach consists in altering EvaluateXxxx signature from:

virtual OutputType EvaluateXxxx(params)

to

virtual void EvaluateXxxx(params, OutputType & output)
The trade off?
  • All EvaluateXxxx functions need to be changed to profit from the optimization (1) ;
  • The performances may be worsened on scalar types -- the cost will be negligible compared to the cost of calling a virtual function milliard times (2).


(1) Indeed, we can preserve the old signature while the code is being converted. e.g.

/*virtual*/ OutputType itk::ImageFunction::EvaluateXxxx(params) {
    OutputType output;
    this->EvaluateXxxx(params, output);
    return output;
}

(2) However, we can imagine duplicating all functions: the classical one for scalar and array pixels, and the new one for VLV pixels. With a slight touch of meta-programming, we could have the compiler choose one type of EvaluateXxxx to call. This idea seems idiotic. However, it would present a very simple way to propagate the current threadId to EvaluateXxxx functions that will always need to know the id of the current thread in order to use cached variables. See §OptimizingVectorImageProcessing#MT_safety_of_cached_pixel.


...to (2.2) unwanted temporaries

Discl. I'll address casting issues into another section.

Again, there are two classes of solutions to temporaries.

- Incremental computation of the results

As teased in §2.7,

const RealType & valx0 = val00 + ( val10 - val00 ) * distance0;
const RealType & valx1 = val01 + ( val11 - val01 ) * distance0
return valx0 + ( valx1 - valx0 ) * distance1;

Can be rewritten into:

RealType valx0 = val10;
valx0 -= val00;
valx0 *= distance0;
valx0 += val00;
RealType valx1 = val11;
valx1 -= val01;
valx1 *= distance0;
valx1 += val01;
valx1 -= valx0;
valx1 *= distance1;
valx1 += valx0;
return valx1;
Pros:
  • We have here a complete control over the computations done.
  • This is likely the fastest way to proceed on VLV objects. Indeed caches misses are more unlikely to happen than with the next approach -- on big vectors, which should not be our case.
Cons:
  • The new way of doing the computations isn't exactly maintainable. It's quite the opposite actually.
  • With this way of proceeding, there is no other way but to use cached variables. This is a consequence of §2.3 the required pixel castings. As a consequence, this need introduces a need for §2.7 MT safety.
- Expression Templates

Historically, Expression Template is one of the first C++ meta-programming technique, if not the first.

Given an expression like val00 + ( val10 - val00 ) * distance0, it will not compute the expression, but build an object that stores all the computation required. It's when the result is assigned, that the computation is really done. When matrices are involved, this technique can become really complex. In our case, we are working with vectors.

Unlike the previous approach that will first compute val10 - vall00 and so on, this time, when the expression is evaluated, it's done element after element. i.e. First we compute, and store, val00[0] + ( val10[0] - val00[0] ) * distance0, then we do the same on index 1, and so on. That's why it may be less cache-friendly on big vectors. Fortunately, this should not inconvenience us.

Pros:
  • Maintaining Image Functions is definitively easier
  • The conversion work is simplified.
Cons:
  • Possibly less cache-friendly on situations that should not be ours -- to be benchmarked!
  • Introducing Expression Templates (ET) into itk::VariableLengthVector may seems a little pointless as many other FOSS like Eigen, or even std::valarray already profit from intensively used and tested ET optimization technique -- that unfortunately won't permit to have proxy pixels.
  • The ET introduction shall also take advantage of C++11 rvalue references.
  • A same casting on an element (from a pixel value) from InputPixelType to RealPixelType may be done several times.

...to (2.3) Pixels casting

The results

Iterative arithmetic

The solution with:

  • thread-safe cached pixels,
  • iterative pixel arithmetic (i.e. m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
  • EvaluateOptimized() that returns pixel values through an [out] parameter.

runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.

AfterOptims-Warp.png

Expression Templates

Perspectives