File Formats

A file format is the structure of a file that tells a program how to display its contents e.g., Microsoft Word documents are saved in the .docx format.

Researchers may need to use different file formats at different stages in the Research (Data and Management) Asset Lifecycle but for long-term preservation, files will need to be stored in a durable format. This ensures they can be opened by future users, perhaps long after the research project has concluded. Where possible, it is recommended to:

USE

  • Formats endorsed by standards agencies such as Standards Australia, ISO
  • Open formats developed and maintained by communities of interest such as OpenDocument Format
  • Lossless formats
  • Formats widely used within a given discipline

AVOID

  • Proprietary formats
  • File format and software obsolescence

Researchers may need to use software that does not save data in a durable format due to discipline-specific or other requirements e.g. specialised programs to capture or generate data. In these circumstances, data needs to be exported to a more durable format such as plain text (if this can be done without losing data integrity) and included alongside the original files when archiving e.g. export .csv files from SPSS (with value labels) and archive them alongside the .sav files.

Some examples of preferred formats for data archiving are:

  • CSV OR Excel spreadsheet (.xlsx) AND OpenDocument Spreadsheet (.ods)
  • Plain text (.txt) OR Word document (.docx) AND Rich text (.rtf), PDF/A or OpenDocument Text (.odt)
  • Geospatial data: ESRI shapefile (.shp, .shx, .dbf), Geo-referenced TIFF (.tif) and ESRI ASCII Grid (.asc)
  • Image files: lossless formats (.tif or .raw) preferred
  • Video: MPEG-4 (.mp4)
  • Audio: Free Lossless Audio Codec (.flac)

The UK Data Service maintains a list of recommended and acceptable formats for agencies, researchers and others depositing social, economic and population data in their collection.


Software and file formats:

The choice of software directly influences the resulting file formats. Documenting the software and equipment used in your data creation, collection, and analysis is essential for transparency and enabling the reproducibility of your research workflows. This information can be added to the Research Data Management Plan (RDMP) in Research Data JCU, and automatically populates the metadata in associated archival Data Records and Data Publications.


Packaged files can be used for archiving large collections of heterogenous datasets with some provisos:

  • Use archives with extensions .zip or .tar
  • Zip the data without any data compression
  • If possible, avoid encrypting the files
  • Be aware that very large packages may be difficult to open from a browser - ETH-Bibliothek recommends packages of less than 2GB
  • Avoid long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters. This hampers further processing in Windows and WinZip cannot unpack such containers.