8  Data Management

8.1 Data discipline begins before analysis

Many research failures are not computational failures. They are data handling failures: unclear sources, missing metadata, overwritten files, silent transformations, or confusion between raw and processed material.

8.2 Keep raw data immutable

As a default rule:

  • never edit raw data in place
  • store a clear original copy
  • document source, date, and acquisition method
  • transform data into new files or tables

This one rule prevents a remarkable number of downstream problems.

8.3 Make file names work for you

A file name is a tiny interface. Good file names make sorting, searching, and collaboration easier.

Common qualities of robust file names:

  • consistent date format such as YYYY-MM-DD
  • descriptive but compact wording
  • no ambiguous version labels like final-final
  • machine-friendly separators

8.4 Metadata is part of the dataset

At minimum, preserve:

  • source and licensing
  • collection method
  • unit definitions
  • codebooks or schema notes
  • processing assumptions
  • quality concerns

If a dataset matters, its context matters too.

8.5 Storage tiers

As data volumes grow, think in tiers:

  • active local storage for current work
  • synced storage for portability and collaboration
  • archival storage for retention and recovery

Not every dataset belongs in every tier. Large raw inputs, confidential records, and regenerated intermediates may need different treatment.

8.6 Versioning rules

Git works very well for text and small structured files. It is not a universal data versioning system. For larger datasets, you may need:

  • release snapshots
  • checksums
  • cloud object storage versioning
  • data package manifests
  • external registries

Choose the lightest approach that still preserves trust.