Difference between revisions of "OptimizingVectorImageProcessing"
Luc.hermitte (Talk | contribs) (First draft) |
Luc.hermitte (Talk | contribs) m (→The context: English grammar) |
||
Line 7: | Line 7: | ||
When the image processed is considered as an <code>otb::Image</code>, the execution take around 20 seconds. When the image processed is considered as an <code>otb::VectorImage</code>, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as ''Vector Image'' are known be inefficient. Aren't they? | When the image processed is considered as an <code>otb::Image</code>, the execution take around 20 seconds. When the image processed is considered as an <code>otb::VectorImage</code>, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as ''Vector Image'' are known be inefficient. Aren't they? | ||
− | Let's see what callgrind | + | Let's see what callgrind has to say about this situation. |
[[File:BeforeOptims-Warp.png|800px|center]] | [[File:BeforeOptims-Warp.png|800px|center]] |
Revision as of 17:44, 26 May 2015
I'll trace here the process I've followed to optimize the processing of an otb::VectorImage
.
The context
We start from a mono-band image that we resample with an itk::LinearInterpolateImageFunction
used from an otb::StreamingResampleImageFilter
.
When the image processed is considered as an otb::Image
, the execution take around 20 seconds. When the image processed is considered as an otb::VectorImage
, the execution takes around 1 minute and 5-10 seconds. Nothing here is really unexpected as Vector Image are known be inefficient. Aren't they?
Let's see what callgrind has to say about this situation.
As we can see, most of the time is spent in memory allocations and releases. If we zoom into itk::LinearInterpolateImageFunction<>::EvaluateOptimized()
, we can see that each call to this function is accompanied with 4 calls to itk::VariableLengthVector::AllocateElements()
, and 6 to delete[]
.
(Actually a few optimizations have already been done, but they are not enough)
The problems
New pixel value => new pixel created
Temporaries elimination
Pixel casting => new pixel created
Pixel assignment => reallocation + copy of old values
dynamic polymorphism
Proxy Pixels
MT safety of cached pixel
The solutions
The results
Iterative arithmetic
The solution with:
- thread-safe cached pixels,
- iterative pixel arithmetic (i.e.
m_val01 = GetPixel(...); m_val02 = GetPixel(...); m_val02 -= m_val01; m_val02 *= d; m_val02 += mval01;
-
EvaluateOptimized()
that returns pixel values through an [out] parameter.
runs in 20seconds (with 1 thread), and we can see in callgrind profile that no allocation (nor releases) are performed on each pixel.