Skip to main content

Training Data AWS Step Function

The step function machine-learning-build-training-data builds the training dataset necessary for training crop detection models. The training data is generated from the annotated treecount and overlapping RGB tiles.

Lambdas

The lambdas orchestrated by the Step function are the following:

  • The image finder lambda GetImageTiles creates an Iterator with available tiles for training data generation.
  • Tiles can be recorded into DynamoDB with the RunSimpleTile lambda using the simple_handler.
  • The bounding boxes of the annotations are generated by the RunAnnotationTask lambda. The size of the box is determined by its buffer_size. Results are written to DynamoDB. This lambda will be retried during processing if it times out. A tile of the array of tiles is consumed when an iteration completes. Certain records will be written as payloads. These records don't matter too much for converting results into geometries.
  • The CollectAnnotations lambda collects the information from DynamoDB and generates a file which can be loaded into ML processing. This file contains the bbox coordinates and s3 keys to the tiles.

Invocation description

The following keys can be used to invoke the step function:

  • job_id str - job id reference.
  • prefix str - prefix to the RGB data.
  • buffer_size float - size of the square in meters, for a 25cm2 box use buffer 2.5cm
  • simple bool - run a lambda to record the tiles in DynamoDB. A simplified form, which will keep the original state of the tile.
  • culture str - specify the name of the label. Defaults to Tree.
  • edge_cases bool - perform an intersect instead of a contain.
  • output_format str - set the output format.
  • output_key str - an output directoy to write the annotations to. Preferably this it the same folder of the RGB tiles. Will write to the metadata folder by default.
  • points str - prefix to the treecount data, use full path to shape to access larger datasets. Can also be defined as an AuroraDB table using workspace:layername:date.
  • resolution float | str - resampling resolution in meters.
  • tile_size int - output tile dimensions in pixels.
  • upload bool - store the new tiles in a bucket.

Example

A file example:

{
"simple": false,
"culture": "Tree",
"edge_cases": true,
"job_id": "20240919130536-1337-fb3b1c868c4f4fb4bce96fbcc3fe3e9d",
"prefix": "path/to/rgb/20200202/",
"buffer_size": 0.4,
"points": "workspace:treecount:20240902",
"resolution": 0.1,
"tile_size": 500,
"output_key": "test/live_test/",
"output_format": "GTiff",
"upload": true
}

An AuroraDB example:

{
"simple": true,
"culture": "Tree",
"edge_cases": true,
"job_id": "20240919130536-1337-fb3b1c868c4f4fb4bce96fbcc3fe3e9d",
"prefix": "path/to/rgb/20200202/",
"buffer_size": 0.3,
"points": "workspace:treecount:20240902",
"resolution": "source",
"tile_size": null,
"output_key": "test/live_test/",
"output_format": "GTiff",
"upload": true
}

WARNING Use UTM projection! Since resolution and buffer size work in meters, don't attempt to ingest spherical coordinate systems. Everything will fail.