Organise Data

Organising your data is an important part of any research project. Setting up Folder Structures, File Formats, File Names and Version Control can save an enormous amount of time later on. It’s wise therefore, to consider your approach to data documentation and metadata at all three phases of the research project i.e. pre, during and post project.

Keeping your active or working data safe and secure is also an important aspect of this step. Backing up data helps to ensure against accidental or malicious loss and damage to your research data.

Active data and working data are the same thing, read the definition of Active Data.

Appropriate data storage is a critical aspect of good research data management.

Many factors can lead to data loss or misuse with devastating consequences for your research and research career. Safeguarding against these should be a priority.

Researchers may need different storage and collaboration solutions at different stages of the Research (Data and Information) Asset Lifecycle. The options, also listed under ‘Active Storage and Collaboration Options’ are suitable for storing active (working) data, collaborating with other researchers, and/or creating backups.

Safe data is data stored in a “safe” data storage system:

The system operates with a low probability of failure
It is managed by JCU staff or an approved third-party provider
It is designed for mid to long-term data storage.

It is important to note that while you may consider the data on the system you are currently using for analysis as your primary or main data, these systems do not necessarily qualify as “safe” storage.

JCU approved storage options for safe active data include:

JCU Microsoft OneDrive | up to 5 TB – staff and students
JCU Microsoft Teams | up to 25 TB per team - available to JCU staff on request
JCU QCIF Research Data Storage (QRISCloud) | greater than 50 GB, up to many TB - access via JCU QRISCloud Cache or direct
Storage as dictated by Research Partner and/or Funding Agency
JCU research file shares - limited availability, based strictly on research need i.e. highly sensitive data

Note: Use Research Data JCU for completed data only | less than 100 MB - contact researchdata@jcu.edu.au for sensitive data or larger datasets.

Options that are suitable for backup storage include:

Shared university network drive (e.g. G, H etc.)
Desktop equipment (e.g. external drive/s, laptop, RAID systems, etc.)
External cloud storage/collaboration space (e.g., Dropbox, Google Drive).

Options that should NOT be used for storing research data include:

JCU High Performance Computing (HPC)
JCU research file shares - limited availability, based strictly on research need i.e temporary storage during analysis where the data needs to be local to the application for optimal performance (e.g. ArcGIS software and laptops)

While you may use many and varied different systems for data analysis during your research (e.g., JCU HPC, Metashape and Galaxy), these should only serve as temporary storage during the active analysis phase. They are not replacements for a long-term data storage solution. Never store your only copy of crucial data on these systems.

JCU HPC may have been used for long-term storage in the past; however, this practice is no longer recommended

For archiving completed data see Data Storage - Completed Data.

The basic rules for storing data and safeguarding against data loss are:

DO keep three copies in separate places i.e., on at least two different types of media (physical device or cloud) and in another location (physical location or cloud)
Ensure at least one copy is stored on a JCU approved option*

DON'T keep the only copy of your research data on a physical device e.g., hard drive (PC, laptop or external HDD) or USB key. These can easily be lost, damaged or fail.

The optimal combination of storage solutions will depend on your specific workflow, the volume and sensitivity of your data, and your preferences for file access and collaboration. For instance, you may prefer to work on your PC’s hard drive and synch to JCU Microsoft OneDrive if factors such as internet access, performance, or application compatibility are important. On the other hand, synching from OneDrive back to your PC (ensure you have sufficient space), facilitates collaboration and provides access to version history and a cloud backup if local storage fails. In practice, a combination of these approaches is likely to be helpful.

IMPORTANT: The following hypothetical research projects and storage options are provided for guidance only and are not prescriptive. To discuss a specific project and storage requirements in more detail please contact researchdata@jcu.edu.au

General

JCU Microsoft OneDrive*
Synchronised with hard drive on personal computer or laptop;
Backed up to an external hard drive or cloud service

Field-work based:
no internet access

Mobile device (tablet or laptop) for offline data collection in the field;
Copied to an external hard drive to create local backups; and
Synchronised with JCU Microsoft OneDrive* on return from the field

Computational analysis:

Hard drive on personal computer or laptop for day-to-day work and analysis (ideally synchronised with JCU Microsoft OneDrive*)
JCU HPRC for large-scale processing and simulations; and/or JCU QCIF Research Data Storage (QRISCloud)* for large datasets (>~50 GB) and collaboration; and
External hard drive(s) for local backup and portability

Sensitive data:

Dedicated JCU “R share” drive* for highly sensitive data and collaboration within the research team;
JCU Microsoft OneDrive* for remote access and external collaboration via link (non-identifiable data)
Encrypted external hard drive stored onsite for offline backup

Active Storage and Collaboration Options

	JCU Microsoft OneDrive	JCU Microsoft Teams	JCU HPRC Analysis phase only	JCU QCIF Research Data Storage (QRISCloud)
Storage Capacity	5 TB	25 TB per team	Up to 5 TB and 250, 000 files	> 50 GB up to many TB and 1,000,000 files
Collaborating with JCU researchers	✔	✔	✔	✔
Collaborating with external researchers	✔	✔¹	✔²	✔
Sensitive Data	✔	✔	✘	✔
Remote Access	✔	✔	✔	✔
Data Stored in Australia	✔	✔	✔	✔
Deleted File Recovery	✔³ via Service Now request	✘	✘	✘
Suitable for Backups	✔	✔	✘	✔
Best Feature	Quick to setup, easy to use, access from day one.	Includes other methods for collaboration along with file sharing.	Excellent for use with HPC compute for data analysis.	Large-scale data storage
Getting Setup	JCU Microsoft OneDrive	Make a collaboration tool request.	JCU Service Now Accessing the HPC	QRIScloud

Notes

Microsoft Team team owners (JCU staff) can add guest accounts or external account.
Researchers can apply (via ServiceNow) for external collaborators to have access to their JCU HPC account.
Within 30 days.

De-identifying data is the process used to prevent someone’s personal identity from being revealed. Data that has been de-identified no longer triggers the Privacy Act.

For example, data from the PALS (Pregnancy and Lifestyle Study) has been de-identified and is available for download. The risk of re-identification via triangulation has also been considered and managed.

Although the study contains highly sensitive data, several techniques have been used to de-identify the dataset e.g. identifiers and dates of birth have been removed, ages have been aggregated into bands - and postcodes have been excluded. It would be possible to re-identify (triangulate) participants by combining (for example) a rural postcode with an occupation.

Think about de-identifying your data early as it can be time consuming and difficult later. The Australian Research Data Commons (ARDC) has some tips on de-identification, listed below and in their Identifiable Data guide. You should also seek discipline-specific advice as required.

plan de-identification early in the research as part of your data management planning
make sure the consent process includes the accepted level of anonymity required and clearly states what may and may not be recorded, transcribed, or shared
retain original unedited versions of data for use within the research team and for preservation
create a de-identification log of all replacements, aggregations or removals made
store the log separately from the de-identified data files
identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags
for qualitative data (such as transcribed interviews or survey textual answers), use pseudonyms or generic descriptors rather than blanking out information
digitally manipulate audio and image files to remove identifying information

File names are frequently overlooked, but are key to locating and retrieving files efficiently, especially for complex or collaborative projects.

Adopting a consistent, logical and descriptive file naming convention is good practice and will assist with data analyses and re-use.

Abbreviations and codes can be used, providing they are clear and uniformly applied. If necessary include a README.txt file in the directory (folder) that explains the naming format and any abbreviations or codes used.

File names can include information such as:

The formatting of file names, file paths and field names (in databases) is very important. Poorly formatted names affect readability and can cause compatibility and processing issues i.e. when sharing data files across platforms, migrating and backing up data, working with command-line interfaces or scripting languages, web servers or URLs.

You should avoid:

special characters such as ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' ‘| While there are differences between the Windows and MacOS operating systems (e.g. colons cause problems in Windows and not on Macs) it is advisable to steer clear of special characters;
spaces in file names. Modern systems and applications have become more lenient regarding spaces but best practice is to use underscores ( _ ), dashes ( - ), or camel case (e.g. FileName) instead, and to apply them consistently;
lengthy file names. For example, Windows has a 250 character limit for file paths. This includes the local drive prefix e.g. C:\Users\jc*****\OneDrive - James Cook University - so lengthy file names and/or a deep file structure can cause issues.

Some examples:

File Names	Details
FG1_GP_20230201.docx	Transcript for the first of several focus groups with general practitioners, conducted on February 1 2023
Assessment_A024_2023-06-05.mp4 Assessment_A024_Scores_AS.xlsx	Clinical assessment (video) for adult patient ID 024, recorded June 5 2023; and Evaluation of the clinical assessment by multiple researchers, including Aditya Sharma (AS)
Syllabus_Chemistry_TextAnalysis_v2.pdf	Descriptive file name. Includes version number
LifestyleSurvey-Singapore-202309-Shared.csv	Survey results with post codes and occupations removed to prevent re-identification (actions recorded in README.txt)
20230812-175923-03.tif	Raw data (image) from instrument ID 03 with date and timestamp (HHMMSS)

Renaming multiple files is onerous but there are bulk renaming utilities that can help, such as:

For Windows Bulk Rename Utility is free and simple to use. Recommended.
For Macs try Renamer 4
Maketecheasier has a few more suggestions for bulk rename utilities for Windows.

A file format is the structure of a file that tells a program how to display its contents e.g., Microsoft Word documents are saved in the .docx format.

Researchers may need to use different file formats at different stages in the Research (Data and Management) Asset Lifecycle but for long-term preservation, files will need to be stored in a durable format. This ensures they can be opened by future users, perhaps long after the research project has concluded. Where possible, it is recommended to:

USE

Formats endorsed by standards agencies such as Standards Australia, ISO
Open formats developed and maintained by communities of interest such as OpenDocument Format
Lossless formats
Formats widely used within a given discipline

AVOID

Proprietary formats
File format and software obsolescence

Researchers may need to use software that does not save data in a durable format due to discipline-specific or other requirements e.g. specialised programs to capture or generate data. In these circumstances, data needs to be exported to a more durable format such as plain text (if this can be done without losing data integrity) and included alongside the original files when archiving e.g. export .csv files from SPSS (with value labels) and archive them alongside the .sav files.

Some examples of preferred formats for data archiving are:

CSV OR Excel spreadsheet (.xlsx) AND OpenDocument Spreadsheet (.ods)
Plain text (.txt) OR Word document (.docx) AND Rich text (.rtf), PDF/A or OpenDocument Text (.odt)
Geospatial data: ESRI shapefile (.shp, .shx, .dbf), Geo-referenced TIFF (.tif) and ESRI ASCII Grid (.asc)
Image files: lossless formats (.tif or .raw) preferred
Video: MPEG-4 (.mp4)
Audio: Free Lossless Audio Codec (.flac)

The UK Data Service maintains a list of recommended and acceptable formats for agencies, researchers and others depositing social, economic and population data in their collection.

Software and file formats:

The choice of software directly influences the resulting file formats. Documenting the software and equipment used in your data creation, collection, and analysis is essential for transparency and enabling the reproducibility of your research workflows. This information can be added to the Research Data Management Plan (RDMP) in Research Data JCU, and automatically populates the metadata in associated archival Data Records and Data Publications.

Packaged files can be used for archiving large collections of heterogenous datasets with some provisos:

Use archives with extensions .zip or .tar
Zip the data without any data compression
If possible, avoid encrypting the files for archiving (noting that encryption may be entirely appropriate for active data or data inputs, where ethical or legal obligations require this)
Be aware that very large packages may be difficult to open from a browser. Some repositories recommend packages of less than 2GB
Avoid long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters. This hampers further processing in Windows and WinZip cannot unpack such containers.

We strongly recommend you package (zip) your files when you upload them to Research Data JCU if you have a large number of files and/or to preserve your folder structure. Dragging and dropping folders via the built in widget will add individual files as attachments but it will not upload the folders.

Planning for and maintaining a consistent and logical folder structure has many benefits. It can:

save you time searching for files
enhance collaboration, as everyone in the team can locate and understand shared materials
help ensure data integrity by reducing the risk of accidental deletion or misplacement
allow you to more easily revisit and share your work.

The optimal folder structure will depend on the nature and complexity of your research project and your disciplinary area.

Below is a hypothetical example illustrating the folder structure for a research project based on experiments:

Project
.../Admin
....../Budget
....../EthicsApprovals
....../Funding
....../Meetings
....../Proposal
.../Data
....../Experiment01
........./Analyses
........./DerivedData
........./Inputs
........./RawData
........./Scripts
....../Experiment02
....../Experiment03
.../Outputs
....../Presentations
....../Publications
........./2022-Mohammed-JCB-CytonemeSignalling *
............/Drafts
............/FiguresTables
........./Nguyen-Cell-ImmuneEvasion
....../Thesis
*A date (year) in the folder name indicates the paper is published.

Note: only data required to validate your findings (in a publication or thesis) needs to be archived in Research Data JCU via a Data Record.

The UK Data Service also provides general advice on folder structures and notes that it can help to restrict the level of folders to three or four deep, and not to have more than ten items in each list.

Data is considered sensitive if it can be used to identify an individual, species, object, or location in a way that introduces a risk of discrimination, harm, or unwanted attention.

Examples of sensitive data include identifiable or re-identifiable personal and health/medical data, Indigenous data, ecological data (e.g. the location of rare or endangered species), and commercial-in-confidence data.

Sensitive data is commonly subject to legal, ethical and/or regulatory requirements that restrict how it can be accessed, handled and shared.

Personal information is sensitive if it directly identifies a person and includes one or more pieces of information from Table 1 (Part I, Division I, Section 6) of the Privacy Act 1988. This information includes:

Racial or ethnic origin
Political opinions
Membership of a political association
Religious beliefs or affiliations
Philosophical beliefs
Membership of a professional or trade association
Membership of a trade union
Sexual orientation or practices
Criminal record
Health information (see section 6FA for definition)
Genetic information
Biometric information.

While sensitive data cannot be published in its original form, in the majority of cases, it can be shared using a combination of:

It's important researchers are aware that data that is not obviously sensitive (no names or dates of birth for example) or that has been de-identified, can become sensitive through triangulation or data linkage.

Triangulation in this context is the process of combining several pieces of non-sensitive information (in the same dataset) to determine the identity or sensitivity of a participant or subject.

Data linkage combines one or more datasets that include the same participant or subject, an activity that carries the risk of re-identification and may place subjects at risk. Data linkage is highly useful (it increases understanding without having to collect new data and derives greater value from existing datasets) and is increasingly common in epidemiology, medical, social and ecological sciences. Researchers should treat the new, linked dataset as an identifiable dataset and assess the risks involved.

High risk data integration projects involving information from Australian, state or territory governments will need to be managed by an accredited integrating authority such as the Australian Institute of Health and Welfare (AIHW), Australian Institute of Family Studies (AIFS) or the Australian Bureau of Statistics (ABS) to ensure security. Once data is linked researchers will access it through a secure data lab in Canberra, a mobile data lab, a remote access computing environment or other secure arrangement and output and use of data will be monitored. The AIHW has useful information on data linkage on their website.

Version control is the process of managing revisions for files, records or datasets. It is particularly important for files that undergo numerous revisions, where there are multiple members of a research team, or when files are shared across multiple locations. Version control ensures you are working with current versions and that you are not wasting valuable research time or putting data at risk.

Basic version control can be achieved by assigning unique File Names and keeping a version control table to record changes. While including details such as initials, date modified, and status (e.g. draft, revised, and final) in file names alongside version numbers aids identification, it can become unwieldy. Much of this information is better captured in a version control table like this one:

Title
Description
Created By
Date Created
Maintained By
Version Number	Modified By	Modifications Made	Date Modified	Status

Best practice:

Planning for best practice involves recognizing that there is no one-size-fits-all solution. Instead, it is essential to make thoughtful decisions regarding:

Retention Policy: Determine how many versions of a file to keep, which versions to retain, the duration of retention, and the folder structures for organizing versions.
Milestone Identification: Identify significant milestone versions, prioritizing major versions over minor ones. For instance, consider keeping version 02-00 but not 02-01.
Naming Conventions: Establish a systematic naming convention to uniquely identify different versions of files.
Documentation: Record changes made to a file when creating a new version and establish clear documentation for tracking those changes.
Relationship Mapping: Record relationships between items as needed, such as between code and the data file it operates on, the data file and related documentation or metadata, or multiple files.
Location Tracking: Track the location of files, especially if stored in various locations.
Synchronization: Regularly synchronize files in different locations to maintain consistency.
Centralized Storage: Identify a single location for storing milestone and master versions.

(Source: Adapted from the UK Data Archive Guide)

Version control systems:

While platforms like OneDrive, Google Docs and Dropbox offer built-in version history and the ability to restore previous versions, this does not substitute for a planned and systematic approach to version control as outlined above. It is also critical to understand how long previous versions are retained when using these services. Always consult the documentation or support resources to ensure alignment with your project's needs.

For more complex research projects, especially those involving extensive collaboration or code development, a dedicated version control system may offer a more robust solution. These systems include sophisticated branching, collaboration, and tracking capabilities.

Git (used with the GitHub or GitLab platforms) and Mercurial are well-known solutions. The Bitbucket platform is also a popular choice and can host repositories that use Git or Mercurial. Subversion (SVN) with TortoiseSVN (Windows-based client for Subversion) is a user-friendly option, especially for those less familiar with command-line interfaces.

Organise Data

Data Storage - Active or Working Data

The basic rules for storing data and safeguarding against data loss are:

Active Storage and Collaboration Options

De-identifying Data

File Names

File Formats

Folder Structures

Sensitive Data

Triangulation, Data Linkage and Integrating Authorities

Version Control