Reproducible analyses

Reproducible analyses#

We’d like to be able to reproduce analysis results easily. While that’s not always fully achievable, the strategies below help significantly. Note that this is only about analyses run on production data, not the development and testing of new code.

  1. Only run code that has been committed to a repository.

    If locally changed code is used to produce analysis results, those results can’t be reproduced if the local changes get lost. To avoid this problem, the analysis runner runs code from GitHub repositories only.

  2. Link the output data with the exact program invocation of how the data has been generated.

    Even if the code is committed to a repository, it’s usually not obvious how a derived dataset was generated. The analysis runner stores a record in metamist/analysis-runner of every invocation, including an ar-guid, the parameters used, and the suggested output directory. NB: Analysis can write files outside this directory.

    Note, as of early 2023, metamist stores an audit log of any metadata changed with an ar-guid if available.

  3. Pin all dependencies of your code to a specific version.

    Docker images, pip, and many other systems let you specify version dependencies. Avoid using “latest”, minimum version requirements or version ranges. Instead, always specify an exact version.

    We currently rely on code reviews to spot these issues. Ideally, we’d have a linter that checks the common cases.

  4. Treat output data as immutable.

    If you overwrite (or delete) outputs, any data that has previously been derived from those outputs gets implicitly invalidated. By default we therefore heavily suggest you should only create new data in our storage policies.