MPI Support
This page contain some documentation about the use of the MPI support, available since the OTB 5.6 release.
Contents
How does it works?
This section describes how the magic operates.
Message Passing Interface (MPI) in OTB
MPI is a communication standard for nodes composing a cluster. Basically, instead of using one node (that is, a computer) to write an image, an entire cluster can be used.
Writing data in parallel
Currently, OTB uses MPI to write an image using multiples nodes, including writing data and workload balancing, for every pipeline supporting streaming. Currently, depending of the way OTB has been built, there is two options to write one image using multiple nodes:
- Writing multiples tiles (using output image file extension .vrt) and a VRT file (OTB_USE_MPI must be set to ON),
- Writing one GeoTiff (using output image file extension .tif) in parallel (OTB_USE_SPTW must be set to ON) using the SPTW library.
Performances and limits
The speedup is defined by the ratio between the elapsed time of a process on one processing node, and the elapsed time fo the same process parallelized over N processing nodes. For one MPI Writer, the speedup provided by the MPI support is maximized if the entire upstream process supports streaming. Theoretically, the speedup can be linear. However, the I/O burden limits considerably the approach when the processing time is epsilon VS the writing time. But, hey, we are not using the MPI support to compute NDVI here... See this paper for more details.
Which OTB applications benefit from it?
This section presents a list of applications that currently benefit from the MPI support . Here is a list of the OTB 6.0 applications that supports natively the parallel approach:
otbcli_BandMath otbcli_BinaryMorphologicalOperation otbcli_BlockMatching otbcli_BundleToPerfectSensor otbcli_ComputeModulusAndPhase otbcli_ConcatenateImages otbcli_Convert otbcli_Despeckle otbcli_DisparityMapToElevationMap otbcli_EdgeExtraction otbcli_ExtractROI otbcli_FineRegistration otbcli_FusionOfClassifications otbcli_GrayScaleMorphologicalOperation otbcli_GridBasedImageResampling otbcli_HaralickTextureExtraction otbcli_ImageClassifier otbcli_LocalStatisticExtraction otbcli_MeanShiftSmoothing otbcli_Mosaic otbcli_MultiResolutionPyramid otbcli_MultivariateAlterationDetector otbcli_OpticalCalibration otbcli_OrthoRectification otbcli_Pansharpening otbcli_PredictRegression otbcli_Quicklook otbcli_RadiometricIndices otbcli_Rescale otbcli_RigidTransformResample otbcli_SARCalibration otbcli_SARDecompositions otbcli_SARPolarMatrixConvert otbcli_SARPolarSynth otbcli_SFSTextureExtraction otbcli_SarRadiometricCalibration otbcli_Smoothing otbcli_SplitImage otbcli_StereoFramework otbcli_StereoRectificationGridGenerator otbcli_Superimpose otbcli_TileFusion
Use case : SLURM
Soon...
Known issues
Socket Binding
Socket binding can sometimes avoid caching issues, especially when CPUs from different sockets have to exchange a lot of data. The ITK multithreading (pthread) framework is great in shared memory context, ensured when all threads are on the same socket. To guarantee this thread positioning, it is advised to (1) bind each MPI process to one only socket, and (2) set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS to the number of CPUs per socket. Of course, the number of MPI processes should be adapted in consequence. For instance, on a 4xsocket with 48cpus (each socket with 12cpus), it is wise to execute 4 MPI processes in parallel, each one using the 12cpus of the socket. That is for instance with mpirun:
export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12 mpirun --bind-to socket -np 4 otbcli_MyApplication ...
OMCA Leave Pinned
OpenMPI overloads some memory related functions like malloc, free, etc. The OpenMPI implementations propose some optimizations, some options,... For example, some hooks are implemented for malloc to lock memory, in order to do some RDMA (Remote Direct Access Memory: some kind of access to the memory of another computer via the network). This might not optimized in some cases. This memory management can be configured in OpenMPI, for instance with the --without-memory-manager option. Besides, the memory management implementation in OpenMPI seems to have changed since the 1.3 version (More info here: https://www.open-mpi.org/faq/?category=openfabrics#leave-pinned-memory-management). In the recent versions of OpenMPI, the runtime parameter mpi_leave_pinned is -1, that is OpenMPI chooses automatically the "good" value (0 or 1), and this is generally 1 when infiniband is used (like this is the case in the cluster I use). The workaround for OS memory management fallback is to use mpirun --mca mpi_leave_pinned 0. (see this topic)