Metadata is ‘data about data’ – i.e., it defines and describes the data. Good metadata is an intrinsic element of the FAIR principles as it ensures data is discoverable and that others can interpret/validate, re-use and cite it correctly. Without metadata, reuse and reproducibility are impossible. Unlike documentation, metadata should be machine-readable.

Metadata serves three main purposes:

  1. It explains the provenance of the data, or how, when and where the data was created and by whom. This is necessary for others to know where the data came from.
  2. It helps other users to understand (the context of) your data. It summarises basic information about the data, which facilitates its reuse.
  3. It increases the findability of the data e.g. because the metadata includes a unique persistent identifier like a DOI that is assigned to the dataset. It may also contain keywords that can be indexed by search engines.

Metadata will need to be created if the researcher plans to publish or share the data or to archive the data in a repository such as Research Data JCU.

The following is an example of the information in a generic metadata file. However, there may be metadata standards that are specific to your domain. Enriching your data with additional domain-specific metadata will make it more useful and findable for others.

Metadata field name Example value
Dataset DOI  

Metadata includes data-level documentation as well as study-level documentation - it should not just describe the project or a publication.

Study-level documentation
Study-level documentation for data is often included in Research Data Management Plans (RDMPs) and provides a high-level overview and context for the data. It is an important component of the metadata and is key to enabling secondary users to make informed use of shared data. Some systems (like Research Data JCU) integrate RDMPs and metadata collection so that researchers don't have to re-enter this information.
Data-level documentation

While it may be tempting to stop at the study-level, metadata also needs to include data-level documentation as this is critical for validating, reproducing and re-using data. It could include:

  • Names, labels and descriptions for variables
  • Definitions of codes and classification schemes
  • Definitions of specialised terminology or acronyms
  • Codes and reasons for missing values (refer to Data Wrangling)
  • Code and scripts used to derive data after collection (simple derivations such as grouping by age levels can be explained in variable and value labels)
  • Weighting and grossing variables created

Data-level documentation may also be embedded in the data itself.

The terms ‘data documentation’, ‘data provenance’ and ‘data lineage’ are often confused. Definitions vary, but they could be considered as a continuum, with data documentation at the broadest level. Provenance is concerned with questions of data origins, maintenance of identity through the data lifecycle, and how we account for data modification. This can be likened to the chain of custody in criminal investigations (previous owners have to be identified and held accountable for the processing and cleaning operations they have performed on the data). Technical data lineage relies on metadata that tracks data flows on the lowest level - tables, scripts, and statements, etc.

Storing Metadata

Metadata can be stored in local systems with the related data - or in data or metadata stores when it is complete. Research Data JCU is an example of an institutional metadata store and contains records (Data Records and Data Publications) for datasets generated by JCU researchers and HDR candidates.

Data Publications in Research Data JCU are harvested regularly and published by Research Data Australia. The Research Data JCU platform also provides secure storage for datasets which (unless restricted) are accessed directly via the catalogue or by negotiation with the data manager.

Data-level documentation/metadata such as workflows, detailed methodologies, variable descriptions, codes and units are often stored with the data (embedded) or included in their own data file (e.g., codebook, README text etc. as supporting documentation).

Embedded documentation can be as simple as a key in a MS Excel spreadsheet (an additional worksheet) or it may be more complex (e.g., for software packages that include facilities for data annotation as variable attributions, table relationships etc). If possible, export this as a plain text file and include it with your supporting documentation, as this facilitates FAIR data.