Organise Data

Organising your data is an important part of any research project. Setting up Folder Structures, File Formats, File Names and Version Control can save an enormous amount of time later on. It’s wise therefore, to consider your approach to data documentation and metadata at all three phases of the research project i.e. pre, during and post project.

Keeping your active or working data safe and secure is also an important aspect of this step. Backing up data helps to ensure against accidental or malicious loss and damage to your research data.

Active data and working data are the same thing, read the definition of Active Data.

Appropriate data storage is one of the most critical aspects of good research data management. There are many circumstances that can lead to data loss with potentially devastating consequences for your research project, and your research career, and safeguarding against these should be a top priority.

Researchers may need different storage and collaboration solutions at different stages of the Research (Data and Information) Asset Lifecycle. The options listed under ‘Active Storage and Collaboration Options’ are suitable for storing; active (working) data, collaborating with other researchers, and/or creating backups.

For archiving completed data see Data Storage - Completed Data.

The basic rules for storing data and safeguarding against data loss are:

DO make three copies of the data and keep them in separate places

  1. A hard drive, external or on your computer;
  2. An external hard drive or cloud drive (JCU supports MS OneDrive and Teams);
  3. A network or other storage solution supported by JCU, e.g.  HPRC

DON'T keep the only copy of your research data on a hard drive, laptop, external drive, USB key, these devices do fail.

Compliant Data Storage Options include:

Non-Compliant Data Storage Options include:

  • Shared university network drive (e.g. G, H etc)
  • Personal equipment (e.g. external drive/s, own laptop, etc)
  • External cloud storage/collaboration space (e.g., Dropbox, Google Drive).

Non-compliant options should only be used for backups, never for primary storage

You will also need to be familiar with the:

For information on active storage and collaboration options, refer to the During Project Phase - Step 2:  Manage | Organise Data | Data Storage - Active or Working Data.

Active Storage and Collaboration Options

 JCU Microsoft OneDriveJCU Microsoft TeamsJCU HPRC
Storage Capacity 5 TB 25 TB per team Up to 5 TB and
 250 KB files
Collaborating with JCU researchers✔✔✔
Collaborating with external researchers✔✔1✔2
Sensitive Data✔✔✘
Remote Access✔✔✔
Data Stored in Australia✔✔✔
Deleted File Recovery✔3
via Service Now request
✘✘
Suitable for Backups✔✔✘
Best Feature Quick to setup, easy to use, access from day one.Includes other methods for collaboration along with file sharing. Excellent for use with HPC compute for data analysis.
Getting SetupJCU Microsoft OneDriveMake a collaboration tool request.

JCU Service Now

Accessing the HPC

Notes

  1. Microsoft Team team owners can add guest accounts or external account.
  2. Researchers can apply (via ServiceNow) for external collaborators to have access to their JCU HPC account.
  3. Within 30 days.

De-identifying data is the process used to prevent someone’s personal identity from being revealed. Data that has been de-identified no longer triggers the Privacy Act.

Here is an example of sensitive data that has been published as open data. In this example, the risk of re-identification via triangulation has been considered and managed and the de-identified dataset can be downloaded from Research Data Australia.

Although the study contains highly sensitive data, several techniques have been used to de-identify the dataset; e.g. identifiers and dates of birth have been removed, ages have been aggregated into bands - and postcodes have been excluded. It would be possible to re-identify (triangulate) participants by combining (for example) a rural postcode with a rare occupation.

Think about de-identifying your data early as it can be time consuming and difficult later. Consult the relevant ANDS guides and seek discipline-specific advice as required.

File names are often taken for granted, but when working on complex research projects it’s important to be able to retrieve files quickly and effectively. It’s good practice to adopt a file naming convention which is consistent, logical and descriptive. Abbreviations and codes can be used as long as they are clear and uniformly applied.

File names could include information such as:

  • Project or experiment name or acronym
  • Researcher name/initials
  • Year or date of experiment
  • Location/spatial coordinates
  • Data type
  • File version number

It's also a good idea to include a readme.txt file in the directory that explains the naming format and any abbreviations or code used.

Avoid really long file names and special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' ‘| in file names, directory paths and field names. Spaces in file names can also cause problems for some software or web applications, so underscores ( _ ), dashes ( - ), or camel case (e.g. FileName) could be used instead.

Re-naming multiple files is onerous but there are bulk renaming utilities that can help, such as:

A file format is the structure of a file that tells a program how to display its contents e.g., Microsoft Word documents are saved in the .docx format.

Researchers may need to use different file formats at different stages in the Research (Data and Management) Asset Lifecycle but for long-term preservation, will need to be stored in a durable format. This ensures files can be opened by future users, perhaps long after the research project has concluded.Where possible, it is recommended to:

USE

  • Formats endorsed by standards agencies such as Standards Australia, ISO
  • Open formats developed and maintained by communities of interest such as OpenDocument Format
  • Lossless formats
  • Formats widely used within a given discipline

AVOID

  • Proprietary formats
  • File format and software obsolescence

Researchers may need to use software that does not save data in a durable format due to discipline-specific or other requirements e.g. specialised programs to capture or generate data. In these circumstances, data needs to be exported to a more durable format such as plain text (if this can be done without losing data integrity) and include it alongside the original files when archiving e.g. export .csv files from SPSS (with value labels) and archive them alongside the .sav files.

Some examples of preferred formats for data archiving are:

  • CSV OR Excel spreadsheet (.xlsx) AND OpenDocument Spreadsheet (.ods)
  • Plain text (.txt) OR Word document (.docx) AND Rich text (.rtf), PDF/A or OpenDocument Text (.odt)
  • Geospatial data: ESRI shapefile (.shp, .shx, .dbf), Geo-referenced TIFF (.tif) and ESRI ASCII Grid (.asc)
  • Image files: lossless formats (.tif or .raw) preferred
  • Video: MPEG-4 (.mp4)
  • Audio: Free Lossless Audio Codec (.flac)

It is also important to document 'data capture' and 'storage formats' as well as 'software' used and their versions – refer to Metadata for further information.

The UK Data Service maintains a list of recommended and acceptable formats for agencies, researchers and others depositing social, economic and population data in their collection.

Packaged files can be used for archiving large collections of heterogenous datasets with some provisos:

  • Use archives with extensions .zip or .tar
  • Zip the data without any data compression
  • If possible, avoid encrypting the files
  • Be aware that very large packages may be difficult to open from a browser - ETH-Bibliothek recommends packages of less than 2GB
  • Avoid long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters. This hampers further processing in Windows and WinZip cannot unpack such containers.

Choose a folder structure for your research project that is uniform and logical in its organisation. If you are working on a collaborative project, it becomes all the more important to use a structure that is well-organised and clear to all parties involved. The kind of folder and file directory structure may ultimately depend on the nature of your research project, the disciplinary area you are working within, and the technical complexity involved. The UK Data Service provides some general advice on folder structures and notes that it helps to restrict the level of folders to three or four deep, and not to have more than ten items in each list.

You may also like to take a look at this tweet from @micahgallen for an effective directory structure (for research projects) and notice how many researchers this resonated with!

Broadly, sensitive data is information that could potentially impact on the rights of others. The Australian National Data Service includes the following definition of sensitive data in their Publishing and Sharing Sensitive Data guide: ‘Sensitive data identifies individuals, species, objects or locations, and carries a risk of causing discrimination, harm or unwanted attention.’

Sensitive data is often about people (i.e. personal information) but ecological data can also be sensitive if it reveals, for example, the location of rare or endangered species. Under law and the research ethics governance of most institutions, sensitive data cannot typically be shared in this form. The Legal and Ethical Framework section outlines the applicable legislation and guidelines.

Personal information is sensitive if it directly identifies a person and includes one or more pieces of information from Table 1 (Part I, Division I, Section 6) of the Privacy Act 1988. This information includes:

  • Racial or ethnic origin
  • Political opinions
  • Membership of a political association
  • Religious beliefs or affiliations
  • Philosophical beliefs
  • Membership of a professional or trade association
  • Membership of a trade union
  • Sexual orientation or practices
  • Criminal record
  • Health information (see section 6FA for definition)
  • Genetic information
  • Biometric information.

While sensitive data cannot be published in its original form, in almost all cases, it can be shared using a combination of:

Refer also to Access Conditions (Open, Conditional, Restricted).

It's important researchers are aware that data that is not obviously sensitive (no names or dates of birth for example) or that has been de-identified, can become sensitive through triangulation or data linkage.

Triangulation in this context is the process of combining several pieces of non-sensitive information (in the same dataset) to determine the identity or sensitivity of a participant or subject.

Data linkage combines one or more datasets that include the same participant or subject, an activity that carries the risk of re-identification and may place subjects at risk. Data linkage is highly useful (it increases understanding without having to collect new data and derives greater value from existing datasets) and is increasingly common in epidemiology, medical, social and ecological sciences. Researchers should treat the new, linked dataset as an identifiable dataset and assess the risks involved.

High risk data integration projects involving information from Australian, state or territory governments will need to be managed by an accredited integrating authority such as the Australian Institute of Health and Welfare (AIHW), Australian Institute of Family Studies (AIFS) or the Australian Bureau of Statistics (ABS) to ensure security. Once data is linked researchers will access it through a secure data lab in Canberra, a mobile data lab, a remote access computing environment or other secure arrangement and output and use of data will be monitored. The AIHW has useful information on data linkage on their website.

Version control is the process of managing file (or record, or dataset) revisions. It is particularly important for files that undergo numerous revisions, where there are multiple members of a research team, or when files are shared across multiple locations. Version control ensures you’re working with current versions and that you’re not wasting valuable research time or putting data at risk.

Basic version control can be achieved by assigning unique file names and keeping a version control table to record changes. Other strategies include using version control facilities within the software you are using (see the UK Data Archive's guide to applying versioning in MS Word), or using file sharing services such as Dropbox or Google Drive, and controlling rights to file editing and manually merging edits by multiple users. You might also consider using specific versioning software such as Git or Mercurial (see also the list below).

Best practice is to:

  • Decide how many versions of a file to keep, which versions to keep, for how long and how to organize versions.
  • Identify milestone versions to keep, e.g., major versions rather than minor versions (keep version 02-00 but not 02-01).
  • Uniquely identify different versions of files using a systematic naming convention.
  • Record changes made to a file when a new version is created.
  • Record relationships between items where needed, for example between code and the data file it is run against; between data file and related documentation or metadata; or between multiple files.
  • Track the location of files if they are stored in a variety of locations.
  • Regularly synchronise files in different locations.
  • Identify a single location for the storage of milestone and master versions.

(Source the UK Data Archive Guide)

The following is a link to a simple version control table format, Version Control Table Template.