Skip to main content

Riverscapes Metadata Standards

Riverscapes Projects and the Riverscapes Data Exchange have several ways of describing the data, ie of providing metadata. See for example, the Riverscapes documentation ecosystem.

To provide a structured way of sharing metadata between the tools that produce the data and consumers of that data - whether those consumers are other tools, or users who may be working with the data in one of the Riverscapes Viewers, a desktop GIS system or via our reports interface, we have developed a new Riverscapes Metadata schema and protocol. It is simple and flexible, providing the ability to describe the wide variety of data used in Riverscapes projects and reports.

Source of Truth - layer_definitions.json

We take a 'metadata as code' approach and expect tools that generate or manipulate data to define the metadata for that data in a text file stored in the source code repository. This file, usually named layer_definitions.json follows a prescribed format, described below. By following that format we can combine and work with all definitions in the same way.

The Metadata Schema Rules (json-schema)

The format of the layer definitions is published as a json schema, publicly available at xml.riverscapes.net/riverscapes_metadata/schema/layer_definitions.schema.json.

Hierarchical Organization

The Riverscapes metadata has the following namespace organization:

  1. The Schema Authority - the code repository within which the code lives
  2. The tool (usually a python package). We refer to this as the Tool schema name
  3. The Tool schema version. If the structure of the data a tool produces changes significantly - and this would include a change in the units of measure reported - then a new version can be published.
  4. The data layer. See What is a layer? below.
  5. Columns within the layer.

Within each level all names must be unique. For example, a tool can't have two layers with the same id; a layer can't have two columns with the same name.

What is a 'layer'?

It can essentially be any data container that fits the rules, but typically will be a database table or spatial layer. Examples could include a Geopackage spatial table or spatial view, a ShapeFile, a table or view in any database, or a raster file such as a Geotiff file.

For backwards compatibility with layer lists in riverscapes project.rs.xml, we allow 'containers' such as GeoPackages or Databases to be considered layers. This allows a reference to the container to exist absent any reference to its contents.

Except for container types, a layer contains structured data (columns).

Column-level metadata

At the column level, there are several valuable metadata items available.

For a raster layer, individual bands can be described as columns.

Data Units

A critical piece of information for data consumers is the unit of measure or data unit used.

We are using the python Pint library for unit handling. Units in the data_unit field must validate using Pint and the comprehensive list of units distributed with that library.

Additionally, we accept NA as a unit meaning "do not parse this as a Quantity". This is appropriate for most text fields, or numeric fields such as ID fields, a telephone number, etc.

This is separate from dimensionless, which is a valid Pint Unit, and should be used for Quantities that have no unit. A ratio such as "metres per metre" or "m/m" will evaluate to dimensionless.

Another valid Pint unit is count.

Caution: units are case-sensitive

Units are case-sensitive. For example, mm, Mm and mM and MM are four different and valid units! Likewise na (nanoyear) is not equivalent to NA (Not Applicable).

Conventions and commonly used units in Riverscapes

The Pint parsing engine is powerful and will accept many variations for units, including abbreviations, plurals, commonly used alternate spellings and modifiers. For example, all of the following are valid and equivalent units!

  • m^2, m*m, square meter, m ** 2, , metre * meter, metres^2

However, we recommend sticking with the following commonly used format conventions:

  • For non-quantities (Not Applicable): NA
  • For ratios and other quantities without a unit: dimensionless
  • For counts: count
  • For standard SI units - use the standard abbreviation: e.g. m or km
  • For non-SI units - use the full name e.g. mile
  • For area (square) units: m^2
  • For volume (cubic) units: m^3
  • For reciprocal units (e.g. beaver dams expressed as 'per mile') : mile^-1 (or count/mile if appropriate)

Unit Systems

While the metric (SI) system is preferred for recording and storage of scientific and commercial data globally, including in the United States, when data are collected in U.S. customary units, then they should be stored that way as well. There is no need to store data in multiple unit systems or to convert between units as our system will do this upon outputting or reporting, to accommodate user preferences.

Data Types (dtype)

The purpose of this element is to declare a data type. When data are transmitted using a format that does not have data types, such as comma separated values (CSV) files, this allows proper 'rehydration' to the correct data type when loaded into a system that does (e.g. Python, Database/SQL systems including GeoPackages).

The dtype element value must be one of the enumerated list of logical data types:

  • INTEGER - ie numbers without a decimal place. Typically counts. e.g. 410
  • FLOAT - floating point numbers, ie numbers that may have a decimal place e.g. 12.75
  • STRING - text or character values of any length, e.g. RED RIVER
  • BOOLEAN - 0 or 1, true or false values
  • DECIMAL - specialized datatype for fixed precision values, often used in accounting
  • GEOMETRY - geo-spatial values
  • STRUCTURED - value is a structured combination of other individual data types. Could be a list, array, map, json, etc.
  • BINARY - e.g. images stored in a database column

dtype_parameters

To provide additional detail about the data type, the element dtype_parameters can be used. The expected implementation is as JSON dictionary of key:value pairs. Examples:

  • for INTEGER: {"bit_depth":8}
  • for GEOMETRY: {"srid":4326, "geometry_type":"POINT"}
  • for DECIMAL: {"precision":10, "scale":2}
  • for STRUCTURED: {"contains":"list", "item_dtype":"STRING"}

Documenting Nested & Structured Data

We sometimes encounter data that isn't purely "flat". Examples include JSON-based data-sources, nested structures in Parquet, or specialized types in Athena (Trino) like STRUCT, ARRAY, and MAP.

To ensure these are accurately described for data consumers, we can use the STRUCTURED logical type combined with a clear naming convention. This provides a human-readable description of the data that bridges both its raw, semi-structured form in the source JSON or its strictly-typed form in a database system.

Mapping Nested Elements (Dot Notation)

To document nested sub-elements, use dot notation in the name field. This allows consumers to understand the hierarchy of the data without requiring the metadata repository itself to be nested.

Example: Digital Elevation Model (DEM) Bins

Given the following structure from the rs_context_huc10 dataset

Source JSON Data:

"dem_bins": {
"min": 282.80,
"bin_size": 100,
"bins": [
{"bin": 200, "cell_count": 1512621},
{"bin": 300, "cell_count": 3746684}
]
}

Athena DDL Representation:

dem_bins struct<
min:double,
bin_size:int,
bins:array<struct<bin:int, cell_count:bigint>>
>

Riverscapes Metadata Mapping:

namefriendly_namedata_unitdtypedtype_parameters
dem_binsData from DEMNASTRUCTURED{"container":"record"}
dem_bins.minLowest ElevationmFLOAT
dem_bins.bin_sizeBin SizemINTEGER
dem_bins.binsElevation binsNASTRUCTURED{"container":"list", "item_dtype": "STRUCTURED"}
dem_bins.bins.binElevation Bin ValuemINTEGER
dem_bins.bins.cell_countCell CountcountINTEGER{"bit_depth":64}

Handling Specialized Containers

Use the dtype_parameters field to provide the "recipe" for rehydrating the structure in consumer tools.

The ARRAY (List)

An ARRAY type contains multiple items of the same data type.

  • dtype: STRUCTURED
  • dtype_parameters: {"container":"list", "item"_dtype":"STRING"}

The MAP (Key-Value Pairs)

A MAP is a collection of unique keys associated with values.

  • dtype: STRUCTURED
  • dtype_parameters: {"container": "map", "key_dtype": "STRING", "val_dtype": "FLOAT"}
  • Athena Mapping: MAP<VARCHAR, DOUBLE>

Documentation Hint: Use column[key] and column[value] to describe the components of the map specifically.

The STRUCT (Record)

A STRUCT is a fixed set of named fields (essentially a row within a cell).

  • dtype: STRUCTURED
  • dtype_parameters: {"container": "record"}
  • Athena Mapping: ROW(...) or STRUCT<...>

Documentation Hint: Use Dot Notation (e.g., parent.child) for every member of the record.

Guidelines for Data Producers

Flatten where possible: If a nested value is high-priority for filtering (like huc_id), consider promoting it to a top-level column.

Be Specific with Units: Even if a parent column is STRUCTURED with a unit of NA, the nested children (like dem_bins.min) should have their specific data_unit defined.

Recursive Definitions: If an array contains records, define the array with item_dtype: STRUCTURED, and then define the record's children using the dot notation.

If a VARCHAR column contains structured data (such as XML or JSON) it could be described using XPath (for XML) or JSONPath (for JSON) syntax.

Providing a Preferred Format

The preferred_format attribute is intended for the data producer to provide a suggested formatting for the values. We use Python's Format Specification Mini-Language.

Format String Reference Table

Format StringInterpretationExample QuantityOutput (No Units)Output (With Units)
{:,.1f}Fixed-point with 1 decimal digit and thousands separator248.745 kilometer248.7248.7 km
{:,.1f}Fixed-point with 1 decimal digit and thousands separator1234567.89 meter1,234,567.91,234,567.9 m
{:.2f}Standard fixed-point, 2 decimals3.14159 meter3.143.14 m
{:.2f}Standard fixed-point, 2 decimals0.009 kilometer0.010.01 km
{:.2f}Standard fixed-point, 2 decimals12.321 count / kilometer ** 212.3212.32 /km²
{:.3g}General format (3 significant digits, drops trailing zeros)12345 meter1.23e+041.23e+04 m
{:.3g}General format (3 significant digits, drops trailing zeros)0.00012345 meter0.0001230.000123 m
{value:.0f} projectsInteger with injected static text suffix1919 projects19 projects
{value:.0f} projectsInteger with injected static text suffix4242 projects42 projects
{value:.0f} projectsInteger with injected static text suffix1322 count1322 projects1322 projects
{value:.1f}Explicit value placeholder (same as {:.1f})87.31 percent87.387.3 %
{value:.1f}Explicit value placeholder (same as {:.1f})12.55 degree12.612.6 deg
NoneNo format (Uses format_scalar defaults, e.g. decimals=0)28.64 meter2929 m
NoneNo format (Uses format_scalar defaults, e.g. decimals=0)100.123 foot3131 m
{:.3~#P}Pint specific: Compact notation (auto-scaling units)12000000 meter ** 212.0 km²12.0 km²
{:.3~#P}Pint specific: Compact notation (auto-scaling units)0.005 kilometer5.0 m5.0 m
{:.3~P}Pint specific: Pretty notation w/o autoscale12000000 meter ** 212,000,00012,000,000 m²
{:.3~P}Pint specific: Pretty notation w/o autoscale0.005 kilometer0.005 km0.005 km
{:.3~P}Pint specific: Pretty notation w/o autoscale0.5515761984704346 / kilometer0.552 1/km0.552 1/km
{:.3~P}Pint specific: Pretty notation w/o autoscale0.5375628311370737 / kilometer0.538 1/km0.538 1/km