TSERT – Time Series ERT file format

This is the description of an example file format, highlighting the proposed process of describing and implementing new file formats.

Warning

This is an experimental file format subject to constant change. For now no backward compatibility between file format versions ist provided!

Name and aims of the new file format

  • Name: TSERT (name in REDA: tsert)

  • Provide a one-file format for monitoring-ERT data, including electrode positions and topography and metadata.

  • Investigate how to apply the features of the HDF5 container format to the storage of geoelectrical data

  • Investigate how to best store meta data along with the data

  • Investigate using compression techniques to reduce the file size

Features

The TSERT file format

  • stores multiple time steps of ERT data

  • stores multiple versions (i.e. filtered/unfiltered) of each data set

  • stores electrode positions

  • stores topography information

  • stores metadata

within one HDF5 file for easy storage and distribution.

Required external python libraries

  • h5py

Structure

  • Data is stored in one hdf5 file and structured in a hierachical tree structure

  • Each timestep is stored in its own HDF group

  • (option) We could store metadata directly in the corresponding attributes of each group, or we could write json-codified metadata into one attribute

  • Each time step is stored in its own group:

    /
    /.attrs['file_format'] = 'tsert'
    /.attrs['format_version'] = 0.1
    INDEX/index <- pandas.DataFrame which holds TS_KEY<->datetime/timestep
                   info; one column: value
    ELECTRODES/
        ELECTRODE_POSITIONS <- pandas.DataFrame
    TOPOGRAPHY/
        [...]
    ERT_DATA/
        [TS_KEY].attrs['metadata'] <- metadata for this data set
    ERT_DATA/[TS_KEY]/base <- original data set
    ERT_DATA/[TS_KEY]/v1 <- filter data set, version 1
    ERT_DATA/[TS_KEY]/v1.attrs['filters'] <- metadata for filtered data set (not implemented)
    ERT_DATA/[TS_KEY]/v2 <- filter data set, version 2
    ERT_DATA/[TS_KEY]/v2.attrs['filters'] <- metadata for filtered data set (not implemented)
    
  • TSKEY is an integer index; actual timestep information (i.e., datetimes) are stored in the timestep column of the INDEX/index dataframe, which TS_KEY-values associated with the dataframe index.

Metadata

Metadata in REDA is implemented using nested dictionaries. This structure can also be saved in the HDF5 container in the group METADATA. Nested dicts are implemented as subgroups, and key-item pairs are stored using the .attrs functionality of the HDF5 container.

Implementation

The TSERT file format is implemented in the following reda source files:

  • lib/reda/exporters/tsert_export.py

  • lib/reda/importers/tsert_import.py

Shortcomings

  • Not sure if this file format use usable for long-term storage (5+ years).

Future enhancements

  • After the format stabilizes it will be easy to extend it to complex electrical data and even to spectral induced polarization data.

  • It would be nice to extend the versioning to include the journal so one can see how a given version was created from the base version.

TODO

  • set up a rudimentary set of tests for tsert

  • investigate how we can only open the file once and then do a full export of data, electrodes, topography, metadata, etc. At the moment we always open/close the file to accommodate different handling strategies (i.e., pandas uses pytables, I think, and therefore cannot work with h5py…)

  • Use compression to reduce the file size

  • Add check summing to detect data corruption