Request for Comments-38: Streaming for OGR data

From OTBWiki
Jump to: navigation, search

[Request for Comments - 38] Streaming for OGR data

Status

  • Author: Guillaume
  • Submitted on 04.08.2017
  • Open for comments / development

Content

What changes will be made and why they would make a better Orfeo ToolBox?

This is a proposal to provide streaming mechanism for OGR data structures, which contain geometries (points, lines, polygons, ...).

Here is the outline of the development plan.

A DataObject for geometries

In order to use the ITK pipeline streaming mechanism, the OGRFeatures need to be encapsulated in a itk::DataObject. There are already 2 classes that are compatible:

  • otb::ogr::DataSource: wrapper class around GDALDataset/OGRDataSource.
  • otb::GeometriesSet: wrapper class to manipulate transparently otb::ogr::DataSource and otb::ogr::Layer

The usage of class otb::ogr::DataSource is more straighforward, but otb::GeometriesSet comes with a set of base filters that already handles the creation and processing of Layers.

Some prototyping work has been done based on otb::GeometriesSet, see streaming_ogr.

The streaming mechanism has been partially implemented in this class, by overriding some generic functions:

  • virtual void SetRequestedRegionToLargestPossibleRegion()
  • virtual void CopyInformation(const DataObject *data)
  • virtual void SetRequestedRegion( const DataObject *data )
A region for geometries

First step is to define a region that can apply to a set of geometries. This region can be used, like with image regions, to define:

  • the total size of the set (i.e. largest region)
  • the subset to process (i.e. requested region)
  • the subset stored (i.e. buffered region)

The different geometries are stored in a OGRLayer, which behaves as a simple list of geometries. A simple type of region can be defined from a continuous range of features :

  • the FId of the first OGRFeature
  • the number of features in the range

Let's call this region a Range region.

Since the features also have a spatial extent, a Spatial region can also be defined using rectangular boxes:

  • the coordinates of the top-left corner of the region
  • the coordinates of the lower-right corner of the region

The features "inside" this region can be defined as the features intersecting the rectangular box. The only difference is that a geometry can be inside several disjointed regions.

These two types of region are streaming compatible. But there could be more ways to partition the set of geometries (by geometric shape, by field value,...). It raises the question : what is the best implementation for theses regions? Obviously, we will need a common interface for these different types of region. I can think of two implementations:

  • A tagged-union: we create a single class that has several modes (Range/Spatial/...). All the functions using this region class will have switch/case sections to handle the different modes. In this case, the region class is not template, it can be added as member of GeometriesSet. This is the solution used in the prototype here.
  • A base class + N deriving classes for each type of region: in this case, we have to define virtual methods that will be overriden to adapt the behavior of each type of region. With this solution, the storage in GeometriesSet is less straightforward: either store a pointer to the base class, or use a template for the region type (GeometriesSet would become GeometriesSet<class TRegion>).

As the 2 region types proposed (range & spatial) should be compatible for most of the use cases, I would go with the first solution.

Common filter for geometries

The OGRDataSource are different from images because they don't have Reader/Writer classes. When an OGRDataSource is created on a file, it is actually a Reader/Writer itself. When created without a file, it is a DataObject in-memory. It means that the wrapper class otb::GeometriesSet will have to do the job of reader and writer. The future pipeline will look like this:

                 +---------+                    +---------+
GeometriesSet1-->| Filter1 |-->GeometriesSet2-->| Filter2 |-->GeometriesSet3
                 +---------+                    +---------+

In this example:

  • GeometriesSet1 ("the reader") will have to find what is the largest region. It can be detected as a "reader" because it has no m_Source.
  • GeometriesSet2 is an in-memory dataset.
  • GeometriesSet3 ("the writer") will be in charge of the streaming execution: split the largest region, propagate each block and call UpdateOutputData()

The base class to derive the filters can be otb::GeometriesToGeometriesFilter. This class contains helper functions to explore the geometries in a otb::GeometriesSet. They use the available iterators over a Layer (with begin() and end()). In order to process a subset of the dataset, we may need different iterators. I can see 2 options:

  • Create a new iterator that will take a RangeRegion as parameter. Then we may have to adapt existing code in otb::GeometriesToGeometriesFilter.
  • Use a strategy similar to the otb::ogr::Layer::SetSpatialFilter() : when the spatial filter is set on a Layer, only the Features in that filter are accessible through begin()/end() standard iterators. We could define a SetRangeFilter() in the class Layer, so that begin()/end() will only run through the desired range. It would make the handling of requested regions transparent to the filter, and the behavior would be the same with SpatialRegions.

My preference goes to the second option.

To test and validate the pipeline, there is also a kind of "UnaryFunctorImageFunctor" for geometries : otb::DefaultGeometriesToGeometriesFilter.

There isn't a lot to implement in the filter base class:

Integration into existing pipelines

One side target would be to integrate this framework into existing pipelines, such as:

  • the sampling framework: which already uses streaming using a support image and spatial regions. This is difficult because the pipeline is a mix of images and vectors.
  • the segmentation framework

This streaming framework could also allow a better implementation of rasterization/vectorization applications.

When will those changes be available (target release or date)?

Probably not before release 6.4.

Who will be developing the proposed changes?

Community

Comments

List here important comments (possibly links) received.

Support

List here community members that support this RFComments.

Corresponding Requests for Changes

No request for change yet.