Request for Changes-36: Samples selection filter
Contents
[Request for Changes - 36] Samples selection filter
Status
- Author: Christophe Palmann, Guillaume Pasero
- Submitted on 01.04.2016
- Update on 17.06.2016
- Proposed target release : 5.6
- Adopted : +3 from Jordi, Victor, Guillaume
- Merged : 53f58478b79da7525e111d824517f867f5750f41
Summary
This RFC is part of new developments to introduce a new sampling framework for machine learning (see also this one : [1]). This is the step number 2 (step 1 is Polygon Analysis).
The filter/application select a set of samples from geometries intended for training (they should have a field giving the associated class). First of all, the geometries must be analyzed by the PolygonClassStatistics application to compute statistics about the geometries. This information must then be passed to the filter/application.
There are four different strategies to select samples (parameter strat in the related application) :
- fit to the smallest class (default) : select the same number of samples in each class, equal to the size of the smallest one.
- same size for all classes : select the same number of samples in each class (should be less or equal to the minimum size)
- number of samples by class : set the required number for each class manually
- take all samples
There are also different sampling types to select N samples among P :
- Periodic : the sampler select samples periodically. Example for N=4 and P=13 : 1-0-0-1-0-0-1-0-0-1-0-0-0. In addition, there is a setting to add jitter during sample selection, each sample can move inside its period. For instance, a 1-among-3 sampling can give 1-0-0, 0-1-0 or 0-0-1.
- Pattern : the sampler uses custom patterns to select samples. They can be randomly generated, or given by the user. Example with a pattern 1-0-0-1 : 1-0-0-1-1-0-0-1-1-0-0-1-1-0-0-1-1-0-0-1-...
- Random : the sampler uses a random shuffle to select the samples indexes to take.
Note : the Pattern sampler is not exposed in the application.
Once the strategy and sampler type are selected, the filter/application output samples in an OGR file.
Rationale
See request for comments #20 : [2]
Implementation details
Classes and files
M Modules/Filtering/Statistics/CMakeLists.txt A Modules/Filtering/Statistics/include/otbPatternSampler.h A Modules/Filtering/Statistics/include/otbPeriodicSampler.h A Modules/Filtering/Statistics/include/otbRandomSampler.h A Modules/Filtering/Statistics/include/otbSamplerBase.h A Modules/Filtering/Statistics/src/CMakeLists.txt A Modules/Filtering/Statistics/src/otbPatternSampler.cxx A Modules/Filtering/Statistics/src/otbPeriodicSampler.cxx A Modules/Filtering/Statistics/src/otbRandomSampler.cxx A Modules/Filtering/Statistics/src/otbSamplerBase.cxx
In module Statistics, the sampler types have been implemented as independent classes. As a matter of fact, they can be re-used in any iteration loop. They have a Reset() method, and a TakeSample() method that return a boolean (to be called inside the iteration loop).
M Modules/Learning/Sampling/include/otbOGRDataToClassStatisticsFilter.h M Modules/Learning/Sampling/include/otbOGRDataToClassStatisticsFilter.txx A Modules/Learning/Sampling/include/otbOGRDataToSamplePositionFilter.h A Modules/Learning/Sampling/include/otbOGRDataToSamplePositionFilter.txx A Modules/Learning/Sampling/include/otbPersistentSamplingFilterBase.h A Modules/Learning/Sampling/include/otbPersistentSamplingFilterBase.txx D Modules/Learning/Sampling/include/otbPolygonClassStatisticsAccumulator.h D Modules/Learning/Sampling/include/otbPolygonClassStatisticsAccumulator.txx A Modules/Learning/Sampling/include/otbSamplingRateCalculator.h M Modules/Learning/Sampling/otb-module.cmake M Modules/Learning/Sampling/src/CMakeLists.txt D Modules/Learning/Sampling/src/otbPolygonClassStatisticsAccumulator.cxx A Modules/Learning/Sampling/src/otbSamplingRateCalculator.cxx
In module Sampling, a new base filter has been added to be used as a common base between otb::OGRDataToClassStatisticsFilter, otb::OGRDataToSamplePositionFilter, and the future sample extraction filter. The class otb::PersistentSamplingFilterBase handles:
- common parameters such as input layer index, field name for class value, OGRLayer creation options,...
- preparation of in-memory vector buffers to handle multi-threading
- streaming processing based on image tiles
- optional mask handling
- iteration over geometries
- progress report
This base class has been used to derive otb::OGRDataToClassStatisticsFilter and otb::OGRDataToSamplePositionFilter, allowing to remove a lot of duplicated code. The code from otb::PolygonClassStatisticsAccumulator has been factorized between the base class and otb::OGRDataToClassStatisticsFilter. Now the PolygonAnalysis and SampleSelection steps can run multi-threaded.
The new filter otb::OGRDataToSamplePositionFilter allows to persistently stream an image (with an optional mask) and select samples from it. The samples are output as points in a OGRDataSource. The filter is templated over the sampler type (periodic/pattern/random). All the fields present in the input geometry are also copied to the output points, in addition with the original FID.
The filter also supports several layers of samplers : when testing a sample, if the current layer doesn't take this one, let's ask the next layer (and so on...). This has been added as a preparation for integration in the existing application TrainImagesClassifier, where training and validation samples are chosen from the same input geometries. In the future, it may also used for stratified sampling.
M Modules/Adapters/GdalAdapters/include/otbOGRFieldWrapper.txx M Modules/Adapters/GdalAdapters/src/otbOGRFieldWrapper.cxx
The modifications to OGRFieldWrapper are just for compilation issues. The 'using namespace boost::mpl' was causing troubles on Windows (problem was not seen on develop). It has been replaced by an alias : "namespace mpl = boost::mpl".
Applications
M Modules/Applications/AppClassification/app/CMakeLists.txt M Modules/Applications/AppClassification/app/otbPolygonClassStatistics.cxx A Modules/Applications/AppClassification/app/otbSampleSelection.cxx
The application exposes the main filter otb::OGRDataToSamplePositionFilter with 3 flavours depending on the sampler type (sampler -> periodic/pattern/random). It also proposes the 3 strategies to compute the number of samples to select in each class (strat -> smallest/constant/byclass).
Note on the "byclass" strategy : the application allows to read input number of samples per class in a CSV file (first column = class name, second column = number of required samples). There is also an output CSV file that contains the whole sampling rates (number of required samples, number of total samples, corresponding rate).
A common feature has been added to PolygonClassStatistics and SampleSelection : the reprojection of input vectors. When their projection is different from the input image, the vectors are reprojected.
Tests
M Modules/Filtering/Statistics/test/CMakeLists.txt A Modules/Filtering/Statistics/test/otbSamplerTest.cxx M Modules/Filtering/Statistics/test/otbStatisticsTestDriver.cxx M Modules/Learning/Sampling/test/CMakeLists.txt A Modules/Learning/Sampling/test/otbOGRDataToSamplePositionFilterTest.cxx A Modules/Learning/Sampling/test/otbSamplingRateCalculatorTest.cxx M Modules/Learning/Sampling/test/otbSamplingTestDriver.cxx M Modules/Applications/AppClassification/test/CMakeLists.txt
The implemented tests only involve the classes mentioned in this RFC.
Documentation
Cookbook : see modifications of pbclassif.tex file.