8 Data Management
8.1 Data discipline begins before analysis
Many research failures are not computational failures. They are data handling failures: unclear sources, missing metadata, overwritten files, silent transformations, or confusion between raw and processed material.
8.2 Keep raw data immutable
As a default rule:
- never edit raw data in place
- store a clear original copy
- document source, date, and acquisition method
- transform data into new files or tables
This one rule prevents a remarkable number of downstream problems.
8.3 Make file names work for you
A file name is a tiny interface. Good file names make sorting, searching, and collaboration easier.
Common qualities of robust file names:
- consistent date format such as
YYYY-MM-DD - descriptive but compact wording
- no ambiguous version labels like
final-final - machine-friendly separators
8.4 Metadata is part of the dataset
At minimum, preserve:
- source and licensing
- collection method
- unit definitions
- codebooks or schema notes
- processing assumptions
- quality concerns
If a dataset matters, its context matters too.
8.5 Storage tiers
As data volumes grow, think in tiers:
- active local storage for current work
- synced storage for portability and collaboration
- archival storage for retention and recovery
Not every dataset belongs in every tier. Large raw inputs, confidential records, and regenerated intermediates may need different treatment.
8.6 Versioning rules
Git works very well for text and small structured files. It is not a universal data versioning system. For larger datasets, you may need:
- release snapshots
- checksums
- cloud object storage versioning
- data package manifests
- external registries
Choose the lightest approach that still preserves trust.