MPI Support

From OTBWiki
Revision as of 10:22, 30 May 2017 by Remicress (Talk | contribs) (Socket Binding)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page contain some documentation about the use of the MPI support, available since the OTB 5.6 release.


How does it works?

This section describes how the magic operates.

Message Passing Interface (MPI) in OTB

MPI is a communication standard for nodes composing a cluster. Basically, instead of using one node (that is, a computer) to write an image, an entire cluster can be used.

Writing data in parallel

Currently, OTB uses MPI to write an image using multiples nodes, including writing data and workload balancing, for every pipeline supporting streaming. Currently, depending of the way OTB has been built, there is two options to write one image using multiple nodes:

  • Writing multiples tiles (using output image file extension .vrt) and a VRT file (OTB_USE_MPI must be set to ON),
  • Writing one GeoTiff (using output image file extension .tif) in parallel (OTB_USE_SPTW must be set to ON) using the SPTW library.

Performances and limits

The speedup is defined by the ratio between the elapsed time of a process on one processing node, and the elapsed time fo the same process parallelized over N processing nodes. For one MPI Writer, the speedup provided by the MPI support is maximized if the entire upstream process supports streaming. Theoretically, the speedup can be linear. However, the I/O burden limits considerably the approach when the processing time is epsilon VS the writing time. But, hey, we are not using the MPI support to compute NDVI here... See this paper for more details.

Which OTB applications benefit from it?

This section presents a list of applications that currently benefit from the MPI support . Here is a list of the OTB 6.0 applications that supports natively the parallel approach:

otbcli_BandMath
otbcli_BinaryMorphologicalOperation
otbcli_BlockMatching
otbcli_BundleToPerfectSensor
otbcli_ComputeModulusAndPhase
otbcli_ConcatenateImages
otbcli_Convert
otbcli_Despeckle
otbcli_DisparityMapToElevationMap
otbcli_EdgeExtraction
otbcli_ExtractROI
otbcli_FineRegistration
otbcli_FusionOfClassifications
otbcli_GrayScaleMorphologicalOperation
otbcli_GridBasedImageResampling
otbcli_HaralickTextureExtraction
otbcli_ImageClassifier
otbcli_LocalStatisticExtraction
otbcli_MeanShiftSmoothing
otbcli_Mosaic
otbcli_MultiResolutionPyramid
otbcli_MultivariateAlterationDetector
otbcli_OpticalCalibration
otbcli_OrthoRectification
otbcli_Pansharpening
otbcli_PredictRegression
otbcli_Quicklook
otbcli_RadiometricIndices
otbcli_Rescale
otbcli_RigidTransformResample
otbcli_SARCalibration
otbcli_SARDecompositions
otbcli_SARPolarMatrixConvert
otbcli_SARPolarSynth
otbcli_SFSTextureExtraction
otbcli_SarRadiometricCalibration
otbcli_Smoothing
otbcli_SplitImage
otbcli_StereoFramework
otbcli_StereoRectificationGridGenerator
otbcli_Superimpose
otbcli_TileFusion

Use case : SLURM

Soon...

Known issues

Socket Binding

Socket binding can sometimes avoid caching issues, especially when CPUs from different sockets have to exchange a lot of data. The ITK multithreading (pthread) framework is great in shared memory context, ensured when all threads are on the same socket. To guarantee this thread positioning, it is advised to (1) bind each MPI process to one only socket, and (2) set ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS to the number of CPUs per socket. Of course, the number of MPI processes should be adapted in consequence. For instance, on a 4xsocket with 48cpus (each socket with 12cpus), it is wise to execute 4 MPI processes in parallel, each one using the 12cpus of the socket. That is for instance with mpirun:

export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12
mpirun --bind-to socket -np 4 otbcli_MyApplication ...

OMCA Leave Pinned

OpenMPI overloads some memory related functions like malloc, free, etc. The OpenMPI implementations propose some optimizations, some options,... For example, some hooks are implemented for malloc to lock memory, in order to do some RDMA (Remote Direct Access Memory: some kind of access to the memory of another computer via the network). This might not optimized in some cases. This memory management can be configured in OpenMPI, for instance with the --without-memory-manager option. Besides, the memory management implementation in OpenMPI seems to have changed since the 1.3 version (More info here: https://www.open-mpi.org/faq/?category=openfabrics#leave-pinned-memory-management). In the recent versions of OpenMPI, the runtime parameter mpi_leave_pinned is -1, that is OpenMPI chooses automatically the "good" value (0 or 1), and this is generally 1 when infiniband is used (like this is the case in the cluster I use). The workaround for OS memory management fallback is to use mpirun --mca mpi_leave_pinned 0. (see this topic)