RDIM Terminology

Terminology

This section provides an alphabetical listing of some of the terminology used in managing research data and information along with the meaning and/or application of these terms.

Click on the letter to see the definitions starting with that letter.

A C D E F H I J L M O P R S T V W

FAIR Data Principles

Under Australia’s FAIR Access Policy Statement, all publicly funded research outputs must follow the FAIR principles.

The FAIR Principles have been developed to make research more visible and to allow researchers to more easily collaborate and maximise the return on investment in research and innovation. The acronym stands for:

Findable
Data can be more findable by: properly describing what the data is; putting it in a permanent and easily searchable place; and making it easy for humans and computers to search for it.
Accessible
Data can be more accessible by: using non-proprietary, standardised and automated methods to supply the data to those who want or need it; letting others know how they can get the data; and letting others know if the data is no longer available.
Interoperable
Data can be more interoperable by: storing and providing the data in widely-used and accessible file formats; describing the data using standard terms (vocabularies) that are relevant and widely known; and describing if it relates to other data and what exactly that relationship is.
Reusable
Data can be more reusable by: making it clear how the data was collected or if there are validity concerns; making any conditions of reuse clear in license readable to humans and machines; and meeting the standards used within the relevant research community.
File Formats

A file format is the structure of a file that tells a program how to display its contents e.g., Microsoft Word documents are saved in the .docx format.

Researchers may need to use different file formats at different stages in the Research (Data and Management) Asset Lifecycle but for long-term preservation, files will need to be stored in a durable format. This ensures they can be opened by future users, perhaps long after the research project has concluded. Where possible, it is recommended to:

USE

  • Formats endorsed by standards agencies such as Standards Australia, ISO
  • Open formats developed and maintained by communities of interest such as OpenDocument Format
  • Lossless formats
  • Formats widely used within a given discipline

AVOID

  • Proprietary formats
  • File format and software obsolescence

Researchers may need to use software that does not save data in a durable format due to discipline-specific or other requirements e.g. specialised programs to capture or generate data. In these circumstances, data needs to be exported to a more durable format such as plain text (if this can be done without losing data integrity) and included alongside the original files when archiving e.g. export .csv files from SPSS (with value labels) and archive them alongside the .sav files.

Some examples of preferred formats for data archiving are:

  • CSV OR Excel spreadsheet (.xlsx) AND OpenDocument Spreadsheet (.ods)
  • Plain text (.txt) OR Word document (.docx) AND Rich text (.rtf), PDF/A or OpenDocument Text (.odt)
  • Geospatial data: ESRI shapefile (.shp, .shx, .dbf), Geo-referenced TIFF (.tif) and ESRI ASCII Grid (.asc)
  • Image files: lossless formats (.tif or .raw) preferred
  • Video: MPEG-4 (.mp4)
  • Audio: Free Lossless Audio Codec (.flac)

The UK Data Service maintains a list of recommended and acceptable formats for agencies, researchers and others depositing social, economic and population data in their collection.


Software and file formats:

The choice of software directly influences the resulting file formats. Documenting the software and equipment used in your data creation, collection, and analysis is essential for transparency and enabling the reproducibility of your research workflows. This information can be added to the Research Data Management Plan (RDMP) in Research Data JCU, and automatically populates the metadata in associated archival Data Records and Data Publications.


Packaged files can be used for archiving large collections of heterogenous datasets with some provisos:

  • Use archives with extensions .zip or .tar
  • Zip the data without any data compression
  • If possible, avoid encrypting the files
  • Be aware that very large packages may be difficult to open from a browser - ETH-Bibliothek recommends packages of less than 2GB
  • Avoid long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters. This hampers further processing in Windows and WinZip cannot unpack such containers.
File Names

File names are frequently overlooked, but are key to locating and retrieving files efficiently, especially for complex or collaborative projects.

Adopting a consistent, logical and descriptive file naming convention is good practice and will assist with data analyses and re-use.

Abbreviations and codes can be used, providing they are clear and uniformly applied. If necessary include a README.txt file in the directory (folder) that explains the naming format and any abbreviations or codes used.

File names can include information such as:

  • Project or experiment name or acronym
  • Researcher name/initials
  • Year or date of experiment
  • Location/spatial coordinates
  • Data type
  • File version number

The formatting of file names, file paths and field names (in databases) is very important. Poorly formatted names affect readability and can cause compatibility and processing issues i.e. when sharing data files across platforms, migrating and backing up data, working with command-line interfaces or scripting languages, web servers or URLs.

You should avoid:

  • special characters such as ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' ‘| While there are differences between the Windows and MacOS operating systems (e.g. colons cause  problems in Windows and not on Macs) it is advisable to steer clear of special characters;
  • spaces in file names. Modern systems and applications have become more lenient regarding spaces but best practice is to use underscores ( _ ), dashes ( - ), or camel case (e.g. FileName) instead, and to apply them consistently;
  • lengthy file names. For example, Windows has a 250 character limit for file paths. This includes the local drive prefix e.g. C:\Users\jc*****\OneDrive - James Cook University - so lengthy file names and/or a deep file structure can cause issues.

Some examples:

File NamesDetails
FG1_GP_20230201.docx Transcript for the first of several focus groups with general practitioners, conducted on February 1 2023
Assessment_A024_2023-06-05.mp4
Assessment_A024_Scores_AS.xlsx
Clinical assessment (video) for adult patient ID 024, recorded June 5 2023; and
Evaluation of the clinical assessment by multiple researchers, including Aditya Sharma (AS)
Syllabus_Chemistry_TextAnalysis_v2.pdf Descriptive file name. Includes version number
LifestyleSurvey-Singapore-202309-Shared.csv Survey results with post codes and occupations removed to prevent re-identification (actions recorded in README.txt)
20230812-175923-03.tif Raw data (image) from instrument ID 03 with date and timestamp  (HHMMSS)

Renaming multiple files is onerous but there are bulk renaming utilities that can help, such as:

Folder Structures

Planning for and maintaining a consistent and logical folder structure has many benefits. It can:

  • save you time searching for files
  • enhance collaboration, as everyone in the team can locate and understand shared materials
  • help ensure data integrity by reducing the risk of accidental deletion or misplacement
  • allow you to more easily revisit and share your work.

The optimal folder structure will depend on the nature and complexity of your research project and your disciplinary area.

Below is a hypothetical example illustrating the folder structure for a research project based on experiments:

Project
.../Admin
....../Budget
....../EthicsApprovals
....../Funding
....../Meetings
....../Proposal
.../Data
....../Experiment01
........./Analyses
........./DerivedData
........./Inputs
........./RawData
........./Scripts
....../Experiment02
....../Experiment03
.../Outputs
....../Presentations
....../Publications
........./2022-Mohammed-JCB-CytonemeSignalling *
............/Drafts
............/FiguresTables
........./Nguyen-Cell-ImmuneEvasion
....../Thesis
*A date (year) in the folder name indicates the paper is published.

Note: only data required to validate your findings (in a publication or thesis) needs to be archived in Research Data JCU via a Data Record.

The UK Data Service also provides general advice on folder structures and notes that it can help to restrict the level of folders to three or four deep, and not to have more than ten items in each list.