De-identifying Data

De-identifying data is the process used to prevent someone’s personal identity from being revealed. Data that has been de-identified no longer triggers the Privacy Act.

For example, data from the PALS (Pregnancy and Lifestyle Study) has been de-identified and is available for download. The risk of re-identification via triangulation has also been considered and managed.

Although the study contains highly sensitive data, several techniques have been used to de-identify the dataset e.g. identifiers and dates of birth have been removed, ages have been aggregated into bands - and postcodes have been excluded. It would be possible to re-identify (triangulate) participants by combining (for example) a rural postcode with an occupation.

Think about de-identifying your data early as it can be time consuming and difficult later. The Australian Research Data Commons (ARDC) has some tips on de-identification, listed below and in their Identifiable Data guide. You should also seek discipline-specific advice as required.

  • plan de-identification early in the research as part of your data management planning
  • make sure the consent process includes the accepted level of anonymity required and clearly states what may and may not be recorded, transcribed, or shared
  • retain original unedited versions of data for use within the research team and for preservation
  • create a de-identification log of all replacements, aggregations or removals made
  • store the log separately from the de-identified data files
  • identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags
  • for qualitative data (such as transcribed interviews or survey textual answers), use pseudonyms or generic descriptors rather than blanking out information
  • digitally manipulate audio and image files to remove identifying information