Organise Data

Organising your data is an important part of any research project. Setting up Folder Structures, File Formats, File Names and Version Control can save an enormous amount of time later on. It’s wise therefore, to consider your approach to data documentation and metadata at all three phases of the research project i.e. pre, during and post project.

Keeping your active or working data safe and secure is also an important aspect of this step. Backing up data helps to ensure against accidental or malicious loss and damage to your research data.

Active data and working data are the same thing, read the definition of Active Data.

Appropriate data storage is a critical aspect of good research data management.

Many factors can lead to data loss or misuse with devastating consequences for your research and research career. Safeguarding against these should be a priority.

Researchers may need different storage and collaboration solutions at different stages of the Research (Data and Information) Asset Lifecycle. The options, also listed under ‘Active Storage and Collaboration Options’ are suitable for storing active (working) data, collaborating with other researchers, and/or creating backups.

JCU approved data storage options include:

Non-approved data storage options include:

  • Shared university network drive (e.g. G, H etc)
  • Personal equipment (e.g. external drive/s, own laptop, etc)
  • External cloud storage/collaboration space (e.g., Dropbox, Google Drive).

Non-approved options should only be used for backups, never for primary storage.

For archiving completed data see Data Storage - Completed Data.

The basic rules for storing data and safeguarding against data loss are:


DO keep three copies in separate places i.e., on at least two different types of media (physical device or cloud) and in another location (physical location or cloud)
Ensure at least one copy is stored on a JCU approved solution.*


DON'T keep the only copy of your research data on a physical device e.g., hard drive (PC, laptop or external HDD) or USB key.
These can easily be lost, damaged or fail.

The optimal combination of storage solutions will depend on your specific workflow, the volume and sensitivity of your data, and your preferences for file access and collaboration. For instance, you may prefer to work on your PC’s hard drive and synch to JCU Microsoft OneDrive if factors such as internet access, performance, or application compatibility are important. On the other hand, synching from OneDrive back to your PC (ensure you have sufficient space), facilitates collaboration and provides access to version history and a cloud backup if local storage fails. In practice, a combination of these approaches is likely to be helpful.

IMPORTANT: The following hypothetical research projects and storage options are provided for guidance only and are not prescriptive. To discuss a specific project and storage requirements in more detail please contact researchdata@jcu.edu.au

General

  1. JCU Microsoft OneDrive*
  2. Synchronised with hard drive on personal computer or laptop;
  3. Backed up to an external hard drive or cloud service

Field-work based:
no internet access

  1. Mobile device (tablet or laptop) for offline data collection in the field;
  2. Copied to an external hard drive to create local backups; and
  3. Synchronised with JCU Microsoft OneDrive* on return from the field

Computational analysis:

  1. Hard drive on personal computer or laptop for day-to-day work and analysis (ideally synchronised with JCU Microsoft OneDrive*)
  2. JCU HPRC* for large-scale processing and simulations; and/or JCU QCIF Research Data Storage (QRISCloud)* for large datasets (>~50 GB) and collaboration; and
  3. External hard drive(s) for local backup and portability

Sensitive data:

  1. Dedicated JCU “R share” drive* for highly sensitive data and collaboration within the research team;
  2. JCU Microsoft OneDrive* for remote access and external collaboration via link (non-identifiable data)
  3. Encrypted external hard drive stored onsite for offline backup

Active Storage and Collaboration Options

 JCU Microsoft OneDriveJCU Microsoft TeamsJCU HPRCJCU QCIF Research Data Storage (QRISCloud)
Storage Capacity 5 TB 25 TB per team Up to 5 TB and
 250, 000 files
> 50 GB up to many TB and 1,000,000 files
Collaborating with JCU researchers✔✔✔✔
Collaborating with external researchers✔✔1✔2✔
Sensitive Data✔✔✘✔
Remote Access✔✔✔✔
Data Stored in Australia✔✔✔✔
Deleted File Recovery✔3
via Service Now request
✘✘✘
Suitable for Backups✔✔✘✔
Best Feature Quick to setup, easy to use, access from day one.Includes other methods for collaboration along with file sharing. Excellent for use with HPC compute for data analysis.Large-scale data storage
Getting SetupJCU Microsoft OneDriveMake a collaboration tool request.

JCU Service Now

Accessing the HPC

QRIScloud

Notes

  1. Microsoft Team team owners can add guest accounts or external account.
  2. Researchers can apply (via ServiceNow) for external collaborators to have access to their JCU HPC account.
  3. Within 30 days.

De-identifying data is the process used to prevent someone’s personal identity from being revealed. Data that has been de-identified no longer triggers the Privacy Act.

For example, data from the PALS (Pregnancy and Lifestyle Study) has been de-identified and is available for download. The risk of re-identification via triangulation has also been considered and managed.

Although the study contains highly sensitive data, several techniques have been used to de-identify the dataset e.g. identifiers and dates of birth have been removed, ages have been aggregated into bands - and postcodes have been excluded. It would be possible to re-identify (triangulate) participants by combining (for example) a rural postcode with an occupation.

Think about de-identifying your data early as it can be time consuming and difficult later. The Australian Research Data Commons (ARDC) has some tips on de-identification, listed below and in their Identifiable Data guide. You should also seek discipline-specific advice as required.

  • plan de-identification early in the research as part of your data management planning
  • make sure the consent process includes the accepted level of anonymity required and clearly states what may and may not be recorded, transcribed, or shared
  • retain original unedited versions of data for use within the research team and for preservation
  • create a de-identification log of all replacements, aggregations or removals made
  • store the log separately from the de-identified data files
  • identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags
  • for qualitative data (such as transcribed interviews or survey textual answers), use pseudonyms or generic descriptors rather than blanking out information
  • digitally manipulate audio and image files to remove identifying information

File names are frequently overlooked, but are key to locating and retrieving files efficiently, especially for complex or collaborative projects.

Adopting a consistent, logical and descriptive file naming convention is good practice and will assist with data analyses and re-use.

Abbreviations and codes can be used, providing they are clear and uniformly applied. If necessary include a README.txt file in the directory (folder) that explains the naming format and any abbreviations or codes used.

File names can include information such as:

  • Project or experiment name or acronym
  • Researcher name/initials
  • Year or date of experiment
  • Location/spatial coordinates
  • Data type
  • File version number

The formatting of file names, file paths and field names (in databases) is very important. Poorly formatted names affect readability and can cause compatibility and processing issues i.e. when sharing data files across platforms, migrating and backing up data, working with command-line interfaces or scripting languages, web servers or URLs.

You should avoid:

  • special characters such as ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' ‘| While there are differences between the Windows and MacOS operating systems (e.g. colons cause  problems in Windows and not on Macs) it is advisable to steer clear of special characters;
  • spaces in file names. Modern systems and applications have become more lenient regarding spaces but best practice is to use underscores ( _ ), dashes ( - ), or camel case (e.g. FileName) instead, and to apply them consistently;
  • lengthy file names. For example, Windows has a 250 character limit for file paths. This includes the local drive prefix e.g. C:\Users\jc*****\OneDrive - James Cook University - so lengthy file names and/or a deep file structure can cause issues.

Some examples:

File NamesDetails
FG1_GP_20230201.docx Transcript for the first of several focus groups with general practitioners, conducted on February 1 2023
Assessment_A024_2023-06-05.mp4
Assessment_A024_Scores_AS.xlsx
Clinical assessment (video) for adult patient ID 024, recorded June 5 2023; and
Evaluation of the clinical assessment by multiple researchers, including Aditya Sharma (AS)
Syllabus_Chemistry_TextAnalysis_v2.pdf Descriptive file name. Includes version number
LifestyleSurvey-Singapore-202309-Shared.csv Survey results with post codes and occupations removed to prevent re-identification (actions recorded in README.txt)
20230812-175923-03.tif Raw data (image) from instrument ID 03 with date and timestamp  (HHMMSS)

Renaming multiple files is onerous but there are bulk renaming utilities that can help, such as:

A file format is the structure of a file that tells a program how to display its contents e.g., Microsoft Word documents are saved in the .docx format.

Researchers may need to use different file formats at different stages in the Research (Data and Management) Asset Lifecycle but for long-term preservation, files will need to be stored in a durable format. This ensures they can be opened by future users, perhaps long after the research project has concluded. Where possible, it is recommended to:

USE

  • Formats endorsed by standards agencies such as Standards Australia, ISO
  • Open formats developed and maintained by communities of interest such as OpenDocument Format
  • Lossless formats
  • Formats widely used within a given discipline

AVOID

  • Proprietary formats
  • File format and software obsolescence

Researchers may need to use software that does not save data in a durable format due to discipline-specific or other requirements e.g. specialised programs to capture or generate data. In these circumstances, data needs to be exported to a more durable format such as plain text (if this can be done without losing data integrity) and included alongside the original files when archiving e.g. export .csv files from SPSS (with value labels) and archive them alongside the .sav files.

Some examples of preferred formats for data archiving are:

  • CSV OR Excel spreadsheet (.xlsx) AND OpenDocument Spreadsheet (.ods)
  • Plain text (.txt) OR Word document (.docx) AND Rich text (.rtf), PDF/A or OpenDocument Text (.odt)
  • Geospatial data: ESRI shapefile (.shp, .shx, .dbf), Geo-referenced TIFF (.tif) and ESRI ASCII Grid (.asc)
  • Image files: lossless formats (.tif or .raw) preferred
  • Video: MPEG-4 (.mp4)
  • Audio: Free Lossless Audio Codec (.flac)

The UK Data Service maintains a list of recommended and acceptable formats for agencies, researchers and others depositing social, economic and population data in their collection.


Software and file formats:

The choice of software directly influences the resulting file formats. Documenting the software and equipment used in your data creation, collection, and analysis is essential for transparency and enabling the reproducibility of your research workflows. This information can be added to the Research Data Management Plan (RDMP) in Research Data JCU, and automatically populates the metadata in associated archival Data Records and Data Publications.


Packaged files can be used for archiving large collections of heterogenous datasets with some provisos:

  • Use archives with extensions .zip or .tar
  • Zip the data without any data compression
  • If possible, avoid encrypting the files
  • Be aware that very large packages may be difficult to open from a browser - ETH-Bibliothek recommends packages of less than 2GB
  • Avoid long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters. This hampers further processing in Windows and WinZip cannot unpack such containers.

Planning for and maintaining a consistent and logical folder structure has many benefits. It can:

  • save you time searching for files
  • enhance collaboration, as everyone in the team can locate and understand shared materials
  • help ensure data integrity by reducing the risk of accidental deletion or misplacement
  • allow you to more easily revisit and share your work.

The optimal folder structure will depend on the nature and complexity of your research project and your disciplinary area.

Below is a hypothetical example illustrating the folder structure for a research project based on experiments:

Project
.../Admin
....../Budget
....../EthicsApprovals
....../Funding
....../Meetings
....../Proposal
.../Data
....../Experiment01
........./Analyses
........./DerivedData
........./Inputs
........./RawData
........./Scripts
....../Experiment02
....../Experiment03
.../Outputs
....../Presentations
....../Publications
........./2022-Mohammed-JCB-CytonemeSignalling *
............/Drafts
............/FiguresTables
........./Nguyen-Cell-ImmuneEvasion
....../Thesis
*A date (year) in the folder name indicates the paper is published.

Note: only data required to validate your findings (in a publication or thesis) needs to be archived in Research Data JCU via a Data Record.

The UK Data Service also provides general advice on folder structures and notes that it can help to restrict the level of folders to three or four deep, and not to have more than ten items in each list.

Data is considered sensitive if it can be used to identify an individual, species, object, or location in a way that introduces a risk of discrimination, harm, or unwanted attention.

Examples of sensitive data include identifiable or re-identifiable personal and health/medical data, Indigenous data, ecological data (e.g. the location of rare or endangered species), and commercial-in-confidence data.

Sensitive data is commonly subject to legal, ethical and/or regulatory requirements that restrict how it can be accessed, handled and shared.

Personal information is sensitive if it directly identifies a person and includes one or more pieces of information from Table 1 (Part I, Division I, Section 6) of the Privacy Act 1988. This information includes:

  • Racial or ethnic origin
  • Political opinions
  • Membership of a political association
  • Religious beliefs or affiliations
  • Philosophical beliefs
  • Membership of a professional or trade association
  • Membership of a trade union
  • Sexual orientation or practices
  • Criminal record
  • Health information (see section 6FA for definition)
  • Genetic information
  • Biometric information.

While sensitive data cannot be published in its original form, in the majority of cases, it can be shared using a combination of:

It's important researchers are aware that data that is not obviously sensitive (no names or dates of birth for example) or that has been de-identified, can become sensitive through triangulation or data linkage.

Triangulation in this context is the process of combining several pieces of non-sensitive information (in the same dataset) to determine the identity or sensitivity of a participant or subject.

Data linkage combines one or more datasets that include the same participant or subject, an activity that carries the risk of re-identification and may place subjects at risk. Data linkage is highly useful (it increases understanding without having to collect new data and derives greater value from existing datasets) and is increasingly common in epidemiology, medical, social and ecological sciences. Researchers should treat the new, linked dataset as an identifiable dataset and assess the risks involved.

High risk data integration projects involving information from Australian, state or territory governments will need to be managed by an accredited integrating authority such as the Australian Institute of Health and Welfare (AIHW), Australian Institute of Family Studies (AIFS) or the Australian Bureau of Statistics (ABS) to ensure security. Once data is linked researchers will access it through a secure data lab in Canberra, a mobile data lab, a remote access computing environment or other secure arrangement and output and use of data will be monitored. The AIHW has useful information on data linkage on their website.

Version control is the process of managing revisions for files, records or datasets. It is particularly important for files that undergo numerous revisions, where there are multiple members of a research team, or when files are shared across multiple locations. Version control ensures you are working with current versions and that you are not wasting valuable research time or putting data at risk.

Basic version control can be achieved by assigning unique File Names and keeping a version control table to record changes. While including details such as initials, date modified, and status (e.g. draft, revised, and final) in file names alongside version numbers aids identification, it can become unwieldy. Much of this information is better captured in a version control table like this one:

Title  
Description  
Created By  
Date Created  
Maintained By  
Version Number Modified By Modifications Made Date Modified Status
     
     

Best practice:

Planning for best practice involves recognizing that there is no one-size-fits-all solution. Instead, it is essential to make thoughtful decisions regarding:

  • Retention Policy: Determine how many versions of a file to keep, which versions to retain, the duration of retention, and the folder structures for organizing versions.
  • Milestone Identification: Identify significant milestone versions, prioritizing major versions over minor ones. For instance, consider keeping version 02-00 but not 02-01.
  • Naming Conventions: Establish a systematic naming convention to uniquely identify different versions of files.
  • Documentation: Record changes made to a file when creating a new version and establish clear documentation for tracking those changes.
  • Relationship Mapping: Record relationships between items as needed, such as between code and the data file it operates on, the data file and related documentation or metadata, or multiple files.
  • Location Tracking: Track the location of files, especially if stored in various locations.
  • Synchronization: Regularly synchronize files in different locations to maintain consistency.
  • Centralized Storage: Identify a single location for storing milestone and master versions.

(Source: Adapted from the UK Data Archive Guide)


Version control systems:

While platforms like OneDrive, Google Docs and Dropbox offer built-in version history and the ability to restore previous versions, this does not substitute for a planned  and systematic approach to version control as outlined above. It is also critical to understand how long previous versions are retained when using these services. Always consult the documentation or support resources to ensure alignment with your project's needs.

For more complex research projects, especially those involving extensive collaboration or code development, a dedicated version control system may offer a more robust solution. These systems include sophisticated branching, collaboration, and tracking capabilities.

Git (used with the GitHub or GitLab platforms) and Mercurial are well-known solutions. The Bitbucket platform is also a popular choice and can host repositories that use Git or Mercurial. Subversion (SVN) with TortoiseSVN (Windows-based client for Subversion) is a user-friendly option, especially for those less familiar with command-line interfaces.