The case against data versioning

2 minute read

A recent technique to advocate for reproducibility in data analysis is data versioning, which means that some (or all) intermediate files generated by the pipeline are saved and tagged so we can come back to them at any moment. But I think data versioning is actually harmful for reproducibility.

Reproducibility is defined as the “ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. Reproducibility is a minimum necessary condition for a finding to be believable and informative (Source).” The key term here is materials. The only materials in a data pipeline are the raw data and the code, all other artifacts are byproducts which should not be considered.

We can test for reproducibility by answering the following question: given the same raw data and code, do we get the same results? Using intermediate results and claim reproducibility is cheating, since we are overlooking all previous computations that originated such interim results.

Reproducibility can only be achieved by construction, it is not a feature you add to your pipeline. The bad news is that you cannot do pip install reproducibility; the only way to achieve it is through better software engineering practices. The good news is that verifying reproducibility is trivial.

Verifying reproducibility

Verifying that a data pipeline is reproducible is as simple as passing raw data and comparing the claimed final output:

import pipeline

result_final = pipeline.run('/path/to/raw/data.csv')
result_expected = 42

assert result_final == result_expected

Saving intermediate results is useful (for other purposes)

Data pipelines are built from steps that run one after the other, when the final output is unexpected, storing intermediate results makes our pipeline more transparent: we can go inspect those results and identify which step went wrong.

They are also useful for avoiding redundant computation. Pipelines usually take a lot of time to run, doing an end-to-end run every time we make a small change is wasteful. Selectively running the steps affected by the changes should lead us to the same result had we executed the pipeline end-to-end.

Versioning your final output

There is one piece of your pipeline that you can version: the final output. If you want to automate reproducibility verification, then you might want to store the final output to compare with the pipeline’s output whenever a change is introduced. If results do not match, your pipeline is no longer reproducible.

Updated: