Riverscapes Metadata Standards
Riverscapes Projects and the Riverscapes Data Exchange have several ways of describing the data, ie of providing metadata. See for example, the Riverscapes documentation ecosystem.
To provide a structured way of sharing metadata between the tools that produce the data and consumers of that data - whether those consumers are other tools, or users who may be working with the data in one of the Riverscapes Viewers, a desktop GIS system or via our reports interface, we have developed a new Riverscapes Metadata schema and protocol. It is simple and flexible, providing the ability to describe the wide variety of data used in Riverscapes projects and reports.
Source of Truth - layer_definitions.json
We take a 'metadata as code' approach and expect tools that generate or manipulate data to define the metadata for that data in a text file stored in the source code repository. This file, usually named layer_definitions.json follows a prescribed format, described below. By following that format we can combine and work with all definitions in the same way.
The Metadata Schema Rules (json-schema)
The format of the layer definitions is published as a json schema, publicly available at xml.riverscapes.net/riverscapes_metadata/schema/layer_definitions.schema.json.
Hierarchical Organization
The Riverscapes metadata has the following namespace organization:
- The Schema Authority - the code repository within which the code lives
- The tool (usually a python package). We refer to this as the Tool schema name
- The Tool schema version. If the structure of the data a tool produces changes significantly - and this would include a change in the units of measure reported - then a new version can be published.
- The data layer. See What is a layer? below.
- Columns within the layer.
Within each level all names must be unique. For example, a tool can't have two layers with the same id; a layer can't have two columns with the same name.
What is a 'layer'?
It can essentially be any data container that fits the rules, but typically will be a database table or spatial layer. Examples could include a Geopackage spatial table or spatial view, a ShapeFile, a table or view in any database, or a raster file such as a Geotiff file.
For backwards compatibility with layer lists in riverscapes project.rs.xml, we allow 'containers' such as GeoPackages or Databases to be considered layers. This allows a reference to the container to exist absent any reference to its contents.
Except for container types, a layer contains structured data (columns).
Column-level metadata
At the column level, there are several valuable metadata items available.
For a raster layer, individual bands can be described as columns.
Data Units
A critical piece of information for data consumers is the unit of measure or data unit used.
We are using the python Pint library for unit handling. Units in the data_unit field must validate using Pint and the comprehensive list of units distributed with that library.
Additionally, we accept NA as a unit meaning "do not parse this as a Quantity". This is appropriate for most text fields, or numeric fields such as ID fields, a telephone number, etc.
This is separate from dimensionless, which is a valid Pint Unit, and should be used for Quantities that have no unit. A ratio such as "metres per metre" or "m/m" will evaluate to dimensionless.
Another valid Pint unit is count.
Caution: units are case-sensitive
Units are case-sensitive. For example, mm, Mm and mM and MM are four different and valid units! Likewise na (nanoyear) is not equivalent to NA (Not Applicable).
Conventions and commonly used units in Riverscapes
The Pint parsing engine is powerful and will accept many variations for units, including abbreviations, plurals, commonly used alternate spellings and modifiers. For example, all of the following are valid and equivalent units!
m^2,m*m,square meter,m ** 2,m²,metre * meter,metres^2
However, we recommend sticking with the following commonly used format conventions:
- For non-quantities (Not Applicable):
NA - For ratios and other quantities without a unit:
dimensionless - For counts:
count - For standard SI units - use the standard abbreviation: e.g.
morkm - For non-SI units - use the full name e.g.
mile - For area (square) units:
m^2 - For volume (cubic) units:
m^3 - For reciprocal units (e.g. beaver dams expressed as 'per mile') :
mile^-1(orcount/mileif appropriate)
Unit Systems
While the metric (SI) system is preferred for recording and storage of scientific and commercial data globally, including in the United States, when data are collected in U.S. customary units, then they should be stored that way as well. There is no need to store data in multiple unit systems or to convert between units as our system will do this upon outputting or reporting, to accommodate user preferences.
Data Types (dtype)
The purpose of this element is to declare a data type. When data are transmitted using a format that does not have data types, such as comma separated values (CSV) files, this allows proper 'rehydration' to the correct data type when loaded into a system that does (e.g. Python, Database/SQL systems including GeoPackages).
The dtype element value must be one of the enumerated list of logical data types:
- INTEGER - ie numbers without a decimal place. Typically counts. e.g. 410
- FLOAT - floating point numbers, ie numbers that may have a decimal place e.g. 12.75
- STRING - text or character values of any length, e.g. RED RIVER
- BOOLEAN - 0 or 1, true or false values
- DECIMAL - specialized datatype for fixed precision values, often used in accounting
- GEOMETRY - geo-spatial values
- STRUCTURED - value is a structured combination of other individual data types. Could be a list, array, map, json, etc.
- BINARY - e.g. images stored in a database column
dtype_parameters
To provide additional detail about the data type, the element dtype_parameters can be used. The expected implementation is as JSON dictionary of key:value pairs. Examples:
- for INTEGER:
{"bit_depth":8} - for GEOMETRY:
{"srid":4326, "geometry_type":"POINT"} - for DECIMAL:
{"precision":10, "scale":2} - for STRUCTURED:
{"contains":"list", "item_dtype":"STRING"}
Documenting Nested & Structured Data
We sometimes encounter data that isn't purely "flat". Examples include JSON-based data-sources, nested structures in Parquet, or specialized types in Athena (Trino) like STRUCT, ARRAY, and MAP.
To ensure these are accurately described for data consumers, we can use the STRUCTURED logical type combined with a clear naming convention. This provides a human-readable description of the data that bridges both its raw, semi-structured form in the source JSON or its strictly-typed form in a database system.
Mapping Nested Elements (Dot Notation)
To document nested sub-elements, use dot notation in the name field. This allows consumers to understand the hierarchy of the data without requiring the metadata repository itself to be nested.
Example: Digital Elevation Model (DEM) Bins
Given the following structure from the rs_context_huc10 dataset
Source JSON Data:
"dem_bins": {
"min": 282.80,
"bin_size": 100,
"bins": [
{"bin": 200, "cell_count": 1512621},
{"bin": 300, "cell_count": 3746684}
]
}
Athena DDL Representation:
dem_bins struct<
min:double,
bin_size:int,
bins:array<struct<bin:int, cell_count:bigint>>
>
Riverscapes Metadata Mapping:
| name | friendly_name | data_unit | dtype | dtype_parameters |
|---|---|---|---|---|
| dem_bins | Data from DEM | NA | STRUCTURED | {"container":"record"} |
| dem_bins.min | Lowest Elevation | m | FLOAT | |
| dem_bins.bin_size | Bin Size | m | INTEGER | |
| dem_bins.bins | Elevation bins | NA | STRUCTURED | {"container":"list", "item_dtype": "STRUCTURED"} |
| dem_bins.bins.bin | Elevation Bin Value | m | INTEGER | |
| dem_bins.bins.cell_count | Cell Count | count | INTEGER | {"bit_depth":64} |
Handling Specialized Containers
Use the dtype_parameters field to provide the "recipe" for rehydrating the structure in consumer tools.
The ARRAY (List)
An ARRAY type contains multiple items of the same data type.
dtype:STRUCTUREDdtype_parameters:{"container":"list", "item"_dtype":"STRING"}
The MAP (Key-Value Pairs)
A MAP is a collection of unique keys associated with values.
dtype:STRUCTUREDdtype_parameters:{"container": "map", "key_dtype": "STRING", "val_dtype": "FLOAT"}- Athena Mapping:
MAP<VARCHAR, DOUBLE>
Documentation Hint: Use column[key] and column[value] to describe the components of the map specifically.
The STRUCT (Record)
A STRUCT is a fixed set of named fields (essentially a row within a cell).
dtype:STRUCTUREDdtype_parameters:{"container": "record"}- Athena Mapping:
ROW(...)orSTRUCT<...>
Documentation Hint: Use Dot Notation (e.g., parent.child) for every member of the record.
Guidelines for Data Producers
Flatten where possible: If a nested value is high-priority for filtering (like huc_id), consider promoting it to a top-level column.
Be Specific with Units: Even if a parent column is STRUCTURED with a unit of NA, the nested children (like dem_bins.min) should have their specific data_unit defined.
Recursive Definitions: If an array contains records, define the array with item_dtype: STRUCTURED, and then define the record's children using the dot notation.
If a VARCHAR column contains structured data (such as XML or JSON) it could be described using XPath (for XML) or JSONPath (for JSON) syntax.
Providing a Preferred Format
The preferred_format attribute is intended for the data producer to provide a suggested formatting for the values. We use Python's Format Specification Mini-Language.
Format String Reference Table
| Format String | Interpretation | Example Quantity | Output (No Units) | Output (With Units) |
|---|---|---|---|---|
{:,.1f} | Fixed-point with 1 decimal digit and thousands separator | 248.745 kilometer | 248.7 | 248.7 km |
{:,.1f} | Fixed-point with 1 decimal digit and thousands separator | 1234567.89 meter | 1,234,567.9 | 1,234,567.9 m |
{:.2f} | Standard fixed-point, 2 decimals | 3.14159 meter | 3.14 | 3.14 m |
{:.2f} | Standard fixed-point, 2 decimals | 0.009 kilometer | 0.01 | 0.01 km |
{:.2f} | Standard fixed-point, 2 decimals | 12.321 count / kilometer ** 2 | 12.32 | 12.32 /km² |
{:.3g} | General format (3 significant digits, drops trailing zeros) | 12345 meter | 1.23e+04 | 1.23e+04 m |
{:.3g} | General format (3 significant digits, drops trailing zeros) | 0.00012345 meter | 0.000123 | 0.000123 m |
{value:.0f} projects | Integer with injected static text suffix | 19 | 19 projects | 19 projects |
{value:.0f} projects | Integer with injected static text suffix | 42 | 42 projects | 42 projects |
{value:.0f} projects | Integer with injected static text suffix | 1322 count | 1322 projects | 1322 projects |
{value:.1f} | Explicit value placeholder (same as {:.1f}) | 87.31 percent | 87.3 | 87.3 % |
{value:.1f} | Explicit value placeholder (same as {:.1f}) | 12.55 degree | 12.6 | 12.6 deg |
| None | No format (Uses format_scalar defaults, e.g. decimals=0) | 28.64 meter | 29 | 29 m |
| None | No format (Uses format_scalar defaults, e.g. decimals=0) | 100.123 foot | 31 | 31 m |
{:.3~#P} | Pint specific: Compact notation (auto-scaling units) | 12000000 meter ** 2 | 12.0 km² | 12.0 km² |
{:.3~#P} | Pint specific: Compact notation (auto-scaling units) | 0.005 kilometer | 5.0 m | 5.0 m |
{:.3~P} | Pint specific: Pretty notation w/o autoscale | 12000000 meter ** 2 | 12,000,000 | 12,000,000 m² |
{:.3~P} | Pint specific: Pretty notation w/o autoscale | 0.005 kilometer | 0.005 km | 0.005 km |
{:.3~P} | Pint specific: Pretty notation w/o autoscale | 0.5515761984704346 / kilometer | 0.552 1/km | 0.552 1/km |
{:.3~P} | Pint specific: Pretty notation w/o autoscale | 0.5375628311370737 / kilometer | 0.538 1/km | 0.538 1/km |