GDAL

The Geographic Data Abstraction Library (GDAL) is, has, and will continue to be one of the most important geospatial software libraries, especially for open source software development. At its core, GDAL is a translation library which abstracts 155 raster and 95 vector formats to a single abstract raster and abstract vector data model. The library provides a variety of geoprocessing functions, bindings in several languages, and an extremely handy virtual filesystem for supporting a wide variety of data stores. For the purposes of this project I’m mostly interested in the raster data model, which looks like this:

gdal-raster-data-model

Figure 1: GDAL raster data model.

VRT

One of the more interesting formats provided by GDAL is the VRT format which is a direct XML encoding of the GDAL data model. From the documentation, the VRT format “allows a virtual GDAL dataset to be composed from other GDAL datasets with repositioning, and algorithms potentially applied as well as various kinds of metadata altered or added”. A simple VRT looks like this:

<VRTDataset rasterXSize="512" rasterYSize="512">
  <GeoTransform>440720.0, 60.0, 0.0, 3751320.0, 0.0, -60.0</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <ColorInterp>Gray</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="1">utm.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SrcRect xOff="0" yOff="0" xSize="512" ySize="512"/>
      <DstRect xOff="0" yOff="0" xSize="512" ySize="512"/>
    </SimpleSource>
  </VRTRasterBand>
</VRTDataset>

We can see that all the essential parts of the raster data model are present. The VRT must contain a reference to the file it represents, in this case utm.tif. When the VRT is examined in a GIS (such as QGIS), the reference is used to display the array values from the source file. The reference can utilize any one of GDAL’s numerous virtual file systems which makes VRT compatible with many different data stores such as cloud storage services (S3, Azure Blob Storage, Google Cloud Storage), HTTP servers, HDFS, and many more.

VRT enables “remote” geospatial operations which don’t require explicitly downloading the file from the server. A legacy workflow to reproject a remote GeoTiff to a different spatial reference requires downloading the entire file before performing the reprojection. With the VRT format, I can make a simple VRT representing the remote file and perform the reprojection on the VRT, skipping the download entirely. When the VRT is viewed by a GIS, the reprojection is performed on the fly and my raster is displayed in its updated coordinate system. The VRT is becoming even more attractive with the rise of cloud computing technologies such as Cloud Optimize GeoTiffs (COG) which encourage users to leave their data in the cloud. COG optimizes the pattern of access to GeoTiff files stored on any HTTP file server which means more efficient access to the underlying file by the VRT.

GDAL-JSON

GDAL-JSON is a python library which exposes a simple interface for manipulating the VRT spec (and thus the GDAL data model). Under the hood, the library uses xmljson to parse the standard VRT format into a JSON representation for manipulation. I personally find JSON’s dict interface much easier to work with than XML. Once parsed, the utm.tif VRT looks like:

{
 "VRTDataset": {
  "@rasterXSize": 512,
  "@rasterYSize": 512,
  "GeoTransform": {
   "$": "440720.0, 60.0, 0.0, 3751320.0, 0.0, -60.0"
  },
  "VRTRasterBand": {
   "@dataType": "Byte",
   "@band": 1,
   "ColorInterp": {
    "$": "Gray"
   },
   "SimpleSource": {
    "SourceFilename": {
     "@relativeToVRT": 1,
     "$": "utm.tif"
    },
    "SourceBand": {
     "$": 1
    },
    "SrcRect": {
     "@xOff": 0,
     "@yOff": 0,
     "@xSize": 512,
     "@ySize": 512
    },
    "DstRect": {
     "@xOff": 0,
     "@yOff": 0,
     "@xSize": 512,
     "@ySize": 512
    }
   }
  }
 }
}

The goal of the library is to JSON-serialize common GDAL operations such as gdal.Warp and gdal.Translate. Warping and translation are currently exposed through the gdaljson.VRTWarpedDataset.warp and gdaljson.VRTDataset.translate methods, respectively, and support the standard gdal.Warp and gdal.Translate inputs (see the GDAL Python docs). Although not all possible input parameters are supported, the ones that are return outputs equivalent to those produced with GDAL without actually requiring or calling any GDAL functions. The entire geoprocessing operation is performed through dictionary manipulation of the underlying JSON data structure!

import gdaljson

# Load VRT from xml string
vrt = gdaljson.VRTWarpedDataset(xml_string)

# Reproject to WGS-84
vrt.warp(outSRS=4326)

# Save to output VRT
with open('vrt_projected_4326.vrt', 'wb') as out_vrt:
  vrt.to_xml(out_vrt)

There is also a utility library containing GDAL methods for translating VRTs to any GDAL supported data format such as GeoTiff. The library, by default, is not built with the utilities to isolate the GDAL dependency.

import gdaljson_utils as utils

# Load VRT object to GDAL.Dataset
out_ds = util.to_gdal(gdaljson.VRTDataset)

# Save VRT object to GTiff
util.to_file(gdaljson.VRTDataset, 'output.tif')

Usage

My primary use of the library is to enable messaging systems in event-driven, containerized geospatial architectures. Optimized messaging requires simple and well-structured messages to function properly. Geospatial data, on the other hand, is often stored in one large file or distributed across many smaller files, both of which are harder to broadcast between containers. The VRT format solves this problem by providing a simple and well-structured message which represents the underlying data. The VRT can easily be encoded to a string or serialized and passed between containers. This ultimately enables the development of efficient cloud-native geoprocessing pipelines. An example might look like:

gdal-raster-data-model

Figure 2: Example geoprocessing pipeline built with GDAL-JSON.

The above example defines a pipeline containing Reproject, Resample, and Clip microservices. Each container receives the incoming VRT, modifies it in place, and sends it to the next container in the pipeline. Once the VRT is processed, we can either save it to a VRT file or use GDAL to convert it to a standard raster format. With GDAL-JSON, we can build an entire plug-and-play ecosystem of geoprocessing functions, allowing the user to string together different combinations of functions to satisfy their workflow. There are several benefits to this approach. Firstly, the pipeline does not create any intermediary data. Secondly, provisioning compute is easier and more efficient because the heaviest operation (saving VRT to file) is isolated to a single container. Thirdly, the entire architecture is quick to deploy because only one container has a GDAL dependency, which is notoriously hard to build and install in containerized solutions.

View the project here!