Data Wrangling (Cleaning)

Data Wrangling or Data Cleaning is the process of identifying and correcting errors and/or making formatting more consistent. It’s often required to prepare data for analysis and/or visualisation, and (where appropriate) when publishing and sharing data. Data also needs to be cleaned before archiving. This will ensure that it’s preserved correctly, is not misinterpreted by other users, and facilitates interoperability (one of the FAIR Principles).

White et al (2013) published an excellent paper ‘Nine simple ways to make it easier to (re)use your data in Ideas in Ecology and Evolution. The authors noted that much of the shared data in ecology and evolutionary biology is not easily reused because they don't follow best practices in terms of data structure, metadata and licences.

Their nine specific recommendations are:

  • Share your data.
  • Provide metadata.
  • Provide an unprocessed form of the data.
  • Use standard data formats.
  • Use good null values.
  • Make it easy to combine your data with other datasets.
  • Perform basic quality control.
  • Use an established repository.
  • Use an established and liberal license