8 Data Management

8.1 Data discipline begins before analysis

Many research failures are not computational failures. They are data handling failures: unclear sources, missing metadata, overwritten files, silent transformations, or confusion between raw and processed material.

8.2 Keep raw data immutable

As a default rule:

never edit raw data in place
store a clear original copy
document source, date, and acquisition method
transform data into new files or tables

This one rule prevents a remarkable number of downstream problems.

8.3 Make file names work for you

A file name is a tiny interface. Good file names make sorting, searching, and collaboration easier.

Common qualities of robust file names:

consistent date format such as YYYY-MM-DD
descriptive but compact wording
no ambiguous version labels like final-final
machine-friendly separators

8.4 Metadata is part of the dataset

At minimum, preserve:

source and licensing
collection method
unit definitions
codebooks or schema notes
processing assumptions
quality concerns

If a dataset matters, its context matters too.

8.5 Storage tiers

As data volumes grow, think in tiers:

active local storage for current work
synced storage for portability and collaboration
archival storage for retention and recovery

Not every dataset belongs in every tier. Large raw inputs, confidential records, and regenerated intermediates may need different treatment.

8.6 Versioning rules

Git works very well for text and small structured files. It is not a universal data versioning system. For larger datasets, you may need:

release snapshots
checksums
cloud object storage versioning
data package manifests
external registries

Choose the lightest approach that still preserves trust.