Configuration#

Note

Scroll to the bottom of this document for the API reference.

The `cpg-utils library <populationgenomics/cpg-utils>`__ (pypi) contains a streamlined config management tool. This config management is used by most production CPG workflows, but is useful in projects or scripts at any scale.

This allows you to run the same code, across multiple datasets, namespaces and even clouds without any change to your code. Configurations like this can make it tricky to work out exactly where parameters come from, we recommend:

  • Putting the parameter on the CLI if the value is unique for each run

  • Putting the parameter in a config if it’s useful for many runs to have this value, and it changes predictably with the dataset.

  • And we discourage using environment variables to pass around information.

This configuration tool uses one or more TOML files, and creates a dictionary of key-value attributes which can be accessed at any point, without explicitly passing a configuration object. If jobs are set up using analysis-runner, config will be set up automatically within each job environment. Please see the end section of this document for extra details on how to set up config outside analysis-runner.

Configs and the analysis-runner#

The analysis-runner is the entry point to analysis at the CPG, but it’s secondary role is to combine a bunch of configs together for your analysis.

This includes:

The analysis-runner server combines all of these configs together.

You can generate an example config using the analysis-runner config command:

analysis-runner config --help

# usage: config subparser [-h] --dataset DATASET -o OUTPUT_DIR [--access-level {test,standard,full}] [--image IMAGE] [--config CONFIG] [--config-output CONFIG_OUTPUT]
#
# options:
#   -h, --help            show this help message and exit
#   --dataset DATASET     The dataset name, which determines which analysis-runner server to send the request to.
#   -o OUTPUT_DIR, --output-dir OUTPUT_DIR
#                         The output directory within the bucket. This should not contain a prefix like "gs://cpg-fewgenomes-main/".
#   --access-level {test,standard,full}
#                         Which permissions to grant when running the job.
#   --image IMAGE         Image name, if using standard / full access levels, this must start with australia-southeast1-docker.pkg.dev/cpg-common/
#   --config CONFIG       Paths to a configurations in TOML format, which will be merged from left to right order (cloudpathlib.AnyPath-compatible paths are supported). The analysis-runner will add the default
#                         environment-related options to this dictionary and make it available to the batch.
#   --config-output CONFIG_OUTPUT
#                         Output path to write the generated config to (in YAML)

TOML#

Tom’s Obvious, Minimal Language is a config file format designed to be easily human readable and writeable, with clear data structures. Sections are delineated using bracketed headings, and key-value pairs are defined using = syntax, e.g. :

global_key = "value"

[heading_1]
name = "Luke Skywalker"
age = 53

[heading_1.subheading]
occupation = ["Jedi", "Hermit", "Force Ghost"]

will be digested into the dictionary:

{
    'global_key': 'value',
    'heading_1': {
        'name': 'Luke Skywalker',
        'age': 53,
        'subheading': {
            'occupation': ["Jedi", "Hermit", "Force Ghost"]
        }
    }
}

Config in Analysis-Runner jobs#

Analysis-runner incorporates a simple interface for config setting. When setting off a job, the flag --config can be used, pointing to a config file (local, or within GCP and accessible with current logged-in credentials).

The --config flag can be used multiple times, which will cause the argument files to be aggregated in the order they are defined. When --config is set in this way, the job-runner performs the following actions:

  1. Locally (where analysis-runner is invoked), a merged configuration file is generated, creating a single dictionary

  2. This dictionary is sent with the job definition to the execution server

  3. The merged data is saved in TOML format to a GCP path

  4. The env. variable CPG_CONFIG_PATH is set to this new TOML location

  5. Within the driver image get_config() can be called safely with no further config setting

If batch jobs are run in containers, passing the environment variable to those containers will allow the same configuration file to be used throughout the Hail Batch. The cpg-utils.hail_batch.copy_common_env method facilitates this environment duplication, and container authentication is required to make the file path in GCP accessible.

Even without additional configurations, analysis-runner will insert infrastructure and run-specific attributes, e.g.

  • config_retrieve(['workflow', 'access_level']) e.g. test, or standard - OR get_access_level() gives the same result

  • config_retrieve(['workflow', 'dataset']) e.g. tob-wgs, or acute-care

Config aggregation#

When passing the Analysis-runner multiple configs, the configs defined earlier are used as a base that is updated with values from configs defined later. New content is added, and content with the exact same key is updated/replaced, e.g.

Base file:

[file]
name = "first.toml"
[content]
square = 4

Second file:

[file]
name = "second.toml"
[content]
triangle = 3

Result:

[file]
name = "second.toml"
[content]
square = 4
triangle = 3

It’s important to note that the config files are loaded ‘left-to-right’, so when multiple configuration files are loaded, only the right-most value for any overlapping keys will be retained.

Reading config#

To use the cpg_utils.config functions, import get_config into any code:

from cpg_utils.config import get_config

The first call to get_config sets the global config dictionary and returns the content, subsequent calls will just return the config dictionary.

assert get_config()['file'] == 'second.toml'

Because configuration is loaded lazily, start-up overhead is minimal, but can result in late failures if files with invalid content are specified.

Config outside Analysis-Runner#

The config utility can be used outside analysis-runner and CPG infrastructure, requiring the user to manually set the config file(s) to be read. Configuration files can be set in two ways:

  1. Set the CPG_CONFIG_PATH environment variable

  2. Use set_config_paths to point to one or more config TOMLs:

    • from cpg_utils.config import set_config_paths

You can refer to the example configuration TOML in this repository and use it as a template.


API Reference#

Provides access to config variables.

exception cpg_utils.config.ConfigError[source]#

Error retrieving keys from config.

class cpg_utils.config.UnsuppliedDefault[source]#
cpg_utils.config.append_config_paths(config_paths: list[str]) None[source]#

Append to the list of config paths. Any values in new configs will have the precedence over the existing CPG_CONFIG_PATH when merging the configs.

cpg_utils.config.config_retrieve(key: list[str] | str, default: Any | None = <class 'cpg_utils.config.UnsuppliedDefault'>, config: frozendict[str, ~typing.Any] | dict[str, ~typing.Any] | None=None) Any[source]#

Retrieve key from config, assuming nested key specified as a list of strings.

>> config_retrieve([‘workflow’, ‘access_level’], config={‘workflow’: {‘access_level’: ‘test’}}) ‘test’

>> config_retrieve([‘workflow’, ‘access_level’], config={}, default=’default’) ‘default’

>> config_retrieve(‘workflow’, config={}) ConfigError(“Key ‘workflow’ not found in {}”)

>> config_retrieve([‘key1’, ‘key2’, ‘key3’], config={‘key1’: {‘key2’: {}}}) ConfigError(‘Key “key3” not found in {} (path: key1 -> key2)’)

Allow None as default value >> config_retrieve([‘key1’, ‘key2’, ‘key3’], config={}, default=None) is None True

cpg_utils.config.cpg_test_dataset_path(suffix: str, category: str | None = None, dataset: str | None = None) str[source]#

CPG-specific method to get corresponding test paths when running from the main namespace.

cpg_utils.config.dataset_path(suffix: str, category: str | None = None, dataset: str | None = None, test: bool = False) str[source]#

Returns a full path for the current dataset, given a category and a path suffix.

This is useful for specifying input files, as in contrast to the output_path function, dataset_path does _not_ take the workflow/output_prefix config variable into account.

Assumes the config structure like below, which is auto-generated by the analysis-runner:

```toml [workflow] access_level = “standard”

[storage.default] default = “gs://thousand-genomes-main” web = “gs://cpg-thousand-genomes-main-web” analysis = “gs://cpg-thousand-genomes-main-analysis” tmp = “gs://cpg-thousand-genomes-main-tmp” web_url = “https://main-web.populationgenomics.org.au/thousand-genomes”

[storage.thousand-genomes] default = “gs://cpg-thousand-genomes-main” web = “gs://cpg-thousand-genomes-main-web” analysis = “gs://cpg-thousand-genomes-main-analysis” tmp = “gs://cpg-thousand-genomes-main-tmp” web_url = “https://main-web.populationgenomics.org.au/thousand-genomes” ```

Examples

Assuming that the analysis-runner has been invoked with –dataset fewgenomes –access-level test:

> from cpg_utils.hail_batch import dataset_path > dataset_path(‘1kg_densified/combined.mt’) ‘gs://cpg-fewgenomes-test/1kg_densified/combined.mt’ > dataset_path(‘1kg_densified/report.html’, ‘web’) ‘gs://cpg-fewgenomes-test-web/1kg_densified/report.html’ > dataset_path(‘1kg_densified/report.html’, ‘web’, test=True) ‘gs://cpg-fewgenomes-test-web/1kg_densified/report.html’ > dataset_path(‘1kg_densified/report.html’, ‘web_url’) ‘https://main-web.populationgenomics.org.au/fewgenomes/1kg_densified/report.html’

Notes

  • If you specify test=True, the workflow/access_level config variable is required

Parameters:
  • suffix (str) – A path suffix to append to the bucket.

  • category (str, optional) – A category like “tmp”, “web”, etc., defaults to “default” if omited.

  • dataset (str, optional) – Dataset name, takes precedence over the workflow/dataset config variable

  • test (bool) – Return “test” namespace version of the path

Return type:

str

cpg_utils.config.genome_build() str[source]#

Return the default genome build name

cpg_utils.config.get_access_level() str[source]#

Get access level from the config.

cpg_utils.config.get_config(print_config: bool = False) frozendict[str, Any][source]#

Returns the configuration dictionary. Consider using config_retrieve(keys) instead.

Call set_config_paths beforehand to override the default path. See read_configs for the path value semantics.

Notes

Caches the result based on the config paths alone.

Return type:

dict

cpg_utils.config.get_config_paths() list[str][source]#

Returns the config paths that are used by subsequent calls to get_config.

If this isn’t called, the value of the CPG_CONFIG_PATH environment variable is used instead.

Return type:

list[str]

cpg_utils.config.get_cpg_namespace(access_level: str | None = None) str[source]#

Get storage namespace from the access level.

cpg_utils.config.get_driver_image() str[source]#

Get the driver image from the config.

cpg_utils.config.get_gcloud_set_project(gcp_project: str | None = None) str[source]#

Get the gcloud command to set the project.

cpg_utils.config.get_gcp_project() str[source]#
cpg_utils.config.image_path(key: str, version: str | list[str] | None = None, repository: str | None = None) str[source]#

Returns a path to a container image for the given key (i.e., image name) and version.

Examples

>> image_path(‘bcftools’, ‘1.16-1’) ‘australia-southeast1-docker.pkg.dev/cpg-common/images/bcftools:1.16-1’

Parameters:
  • key (str) – Specifies the image name. When version is not specified: Describes the key within the images config section. Can list sections separated with ‘/’.

  • version (str or list[str], optional) – Specifies the desired image version, e.g., ‘1.18-1’, either directly as a version number string or indirectly via a config key list which will be used to retrieve a version number string via config_retrieve.

  • repository (str, optional) – The suffix (e.g., ‘dev’ for images-dev) of an artifact registry repository to be used instead of the default production images repository.

  • future (Using image_path(key) without giving version is deprecated. In)

:param : :param specifying it will be required.:

Return type:

str

cpg_utils.config.output_path(suffix: str, category: str | None = None, dataset: str | None = None, test: bool = False) str[source]#

Returns a full path for the given category and path suffix.

In contrast to the dataset_path function, output_path takes the workflow/output_prefix config variable into account.

Examples

If using the analysis-runner, the workflow/output_prefix would be set to the value provided using the –output argument, e.g.: ` analysis-runner --dataset fewgenomes --access-level test --output 1kg_pca/v42` ... ` will use ‘1kg_pca/v42’ as the base path to build upon in this method:

> from cpg_utils.hail_batch import output_path > output_path(‘loadings.ht’) ‘gs://cpg-fewgenomes-test/1kg_pca/v42/loadings.ht’ > output_path(‘report.html’, ‘web’) ‘gs://cpg-fewgenomes-test-web/1kg_pca/v42/report.html’

Notes

Requires the workflow/output_prefix config variable to be set, in addition to the requirements for dataset_path.

Parameters:
  • suffix (str) – A path suffix to append to the bucket + output directory.

  • category (str, optional) – A category like “tmp”, “web”, etc., defaults to “default” if ommited.

  • dataset (str, optional) – Dataset name, takes precedence over the workflow/dataset config variable

  • test (bool, optional) – Boolean - if True, generate a test bucket path. Default to False.

Return type:

str

cpg_utils.config.prepend_config_paths(config_paths: list[str]) None[source]#

Prepend to the list of config paths. Equivalent to dict.set_defaults: any values in current CPG_CONFIG_PATH will have the precedence over the provided config_paths when merging the configs.

cpg_utils.config.read_configs(config_paths: list[str]) frozendict[str, Any][source]#

Creates a merged configuration from the given config paths. This does NOT affect any state, re get_config.

For a list of configurations (e.g. [‘base.toml’, ‘override.toml’]), the configurations get applied from left to right. I.e. the first config gets updated by values of the second config, etc.

Return type:

dict

cpg_utils.config.reference_path(key: str) str[source]#

Returns a path to a reference resource using key in config’s “references” section.

Examples

>> reference_path(‘vep_mount’) ‘gs://cpg-common-main/references/vep/105.0/mount’ >> reference_path(‘broad/genome_calling_interval_lists’) ‘gs://cpg-common-main/references/hg38/v0/wgs_calling_regions.hg38.interval_list’

Assuming config structure as follows:

`toml [references] vep_mount = 'gs://cpg-common-main/references/vep/105.0/mount' [references.broad] genome_calling_interval_lists = 'gs://cpg-common-main/references/hg38/v0/wgs_calling_regions.hg38.interval_list' `

Parameters:

key (str) – Describes the key within the references config section. Can list sections separated with ‘/’.

Return type:

str

cpg_utils.config.set_config_paths(config_paths: list[str]) None[source]#

Sets the config paths that are used by subsequent calls to get_config.

If this isn’t called, the value of the CPG_CONFIG_PATH environment variable is used instead.

Parameters:

config_paths (list[str]) – A list of cloudpathlib-compatible paths to TOML files containing configurations.

cpg_utils.config.try_get_ar_guid()[source]#

Attempts to get the AR GUID from the environment.

This is a fallback for when the AR GUID is not available in the config.

cpg_utils.config.update_dict(d1: dict, d2: dict) dict[source]#

Updates the d1 dict with the values from the d2 dict recursively in-place. Returns the pointer to d1 (the same as )

>>> update_dict({'a': 1, 'b': {'c': 1}}, {'b': {'c': 2, 'd': 2}})
{'a': 1, 'b': {'c': 2, 'd': 2}}
cpg_utils.config.web_url(suffix: str = '', dataset: str | None = None, test: bool = False) str[source]#

Web URL to match the dataset_path of category ‘web_url’.