Request for Comments-27: Large scale Random Forest

From OTBWiki
Jump to: navigation, search

Status

  • Author: Manuel Grizonnet
  • Submitted on 2016/02/10
  • Open for comments

Content

What changes will be made and why would they make a better Orfeo ToolBox?

This RFC introduces a new OTB module that offers a framework for a new memory-aware Random Forest algorithm, which combines several strategies to build an ensemble of decision trees and allow to train RF with training dataset of arbitrary size. The algorithm is designed to make an efficient memory usage by minimizing I/O operations. It is the continuation of the contribution done by Pierre Lassalle during his Phd thesis:

Thesis manuscript (in French) is available here (p 115):

http://jordiinglada.net/stok/manuscrit-lassalle-final.pdf

Jordi started to work on the code and pushed some patches in a fork:

http://tully.ups-tlse.fr/jordi/otbsmart/

We should start from this version.

The objective is to develop filters and applications that are modular and reusable, support different learning strategies proposed by P. Lassalle:

  • InMemory strategy : Basic strategy which implements efficiently decision tree and RF method
  • InPartitionMemory strategy: strategy is selected when only the distribution of the samples in the decision tree can be stored in memory

in main memory (third part library STLXXL)

  • OutMemory strategy : this strategy is used in case the samples (with their labels) and the partition table can't be store in memory. It uses a third part library kyotocabinet).

Strategy for code integrations:

Make otbsmart a remote module

Make otbsmart a remote module.

- move code in proper directory - make otb smart main exec an application - Allow to test strategies and application on cdash - Note that this module will require C++11

Make InMemory and InPartition strategy compatible with otb::MachineLearningModel class

Encapsulate RF in memory and in partition strategies in otb::MachineLearningModel classes.

Export to Opencv format

Allow to export learning model of RF large scale strategy in format compatible with OpenCV.

Expose large scale RF strategies in OTB Classification framework

Large scale random forest strategies will be then available in OTB applications.

What can be done (not clear for now for me) TBD:

- Integrate it in current pixel based classification framework (TrainImagesClassifier, ImageClassifier application...) - Add a set of specific applications dedicated to large scale random forest classification

There should be a done in coordination with Request for Comments-20: New sampling module for the classification framework.

Perspectives

- Extend the methodology in the context of distributed computing (especially multi-threading of the learning process). - Integrate OutMemory strategy in OTB.

When will those changes be available (target release or date)?

Target release is 5.4.

Who will be developing the proposed changes?

TBD

Community

Comments

Support

Corresponding Requests for Changes