Configuration#
Note
Scroll to the bottom of this document for the API reference.
The `cpg-utils library <populationgenomics/cpg-utils>`__
(pypi) contains a streamlined
config management tool. This config management is used by most
production CPG workflows, but is useful in projects or scripts at any
scale.
This allows you to run the same code, across multiple datasets, namespaces and even clouds without any change to your code. Configurations like this can make it tricky to work out exactly where parameters come from, we recommend:
Putting the parameter on the CLI if the value is unique for each run
Putting the parameter in a config if it’s useful for many runs to have this value, and it changes predictably with the dataset.
And we discourage using environment variables to pass around information.
This configuration tool uses one or more TOML files, and creates a
dictionary of key-value attributes which can be accessed at any point,
without explicitly passing a configuration object. If jobs are set up
using analysis-runner, config will be set up automatically within
each job environment. Please see the end section of this document for
extra details on how to set up config outside analysis-runner.
Configs and the analysis-runner#
The analysis-runner is the entry point to analysis at the CPG, but it’s secondary role is to combine a bunch of configs together for your analysis.
This includes:
Storage configuration generated by the cpg-infrastructure
Selected configuration attributes (also from cpg-infrastructure)
The analysis-runner server combines all of these configs together.
You can generate an example config using the analysis-runner config
command:
analysis-runner config --help
# usage: config subparser [-h] --dataset DATASET -o OUTPUT_DIR [--access-level {test,standard,full}] [--image IMAGE] [--config CONFIG] [--config-output CONFIG_OUTPUT]
#
# options:
# -h, --help show this help message and exit
# --dataset DATASET The dataset name, which determines which analysis-runner server to send the request to.
# -o OUTPUT_DIR, --output-dir OUTPUT_DIR
# The output directory within the bucket. This should not contain a prefix like "gs://cpg-fewgenomes-main/".
# --access-level {test,standard,full}
# Which permissions to grant when running the job.
# --image IMAGE Image name, if using standard / full access levels, this must start with australia-southeast1-docker.pkg.dev/cpg-common/
# --config CONFIG Paths to a configurations in TOML format, which will be merged from left to right order (cloudpathlib.AnyPath-compatible paths are supported). The analysis-runner will add the default
# environment-related options to this dictionary and make it available to the batch.
# --config-output CONFIG_OUTPUT
# Output path to write the generated config to (in YAML)
TOML#
Tom’s Obvious, Minimal Language is a config
file format designed to be easily human readable and writeable, with
clear data structures. Sections are delineated using bracketed headings,
and key-value pairs are defined using = syntax, e.g. :
global_key = "value"
[heading_1]
name = "Luke Skywalker"
age = 53
[heading_1.subheading]
occupation = ["Jedi", "Hermit", "Force Ghost"]
will be digested into the dictionary:
{
'global_key': 'value',
'heading_1': {
'name': 'Luke Skywalker',
'age': 53,
'subheading': {
'occupation': ["Jedi", "Hermit", "Force Ghost"]
}
}
}
Config in Analysis-Runner jobs#
Analysis-runner incorporates a simple interface for config setting. When
setting off a job, the flag --config can be used, pointing to a
config file (local, or within GCP and accessible with current logged-in
credentials).
The --config flag can be used multiple times, which will cause the
argument files to be aggregated in the order
they are defined. When --config is set in this way, the job-runner
performs the following actions:
Locally (where
analysis-runneris invoked), a merged configuration file is generated, creating a single dictionaryThis dictionary is sent with the job definition to the execution server
The merged data is saved in TOML format to a GCP path
The env. variable
CPG_CONFIG_PATHis set to this new TOML locationWithin the driver image
get_config()can be called safely with no further config setting
If batch jobs are run in containers, passing the environment variable to
those containers will allow the same configuration file to be used
throughout the Hail Batch. The cpg-utils.hail_batch.copy_common_env
method
facilitates this environment duplication, and container
authentication
is required to make the file path in GCP accessible.
Even without additional configurations, analysis-runner will insert infrastructure and run-specific attributes, e.g.
config_retrieve(['workflow', 'access_level'])e.g. test, or standard - ORget_access_level()gives the same resultconfig_retrieve(['workflow', 'dataset'])e.g. tob-wgs, or acute-care
Config aggregation#
When passing the Analysis-runner multiple configs, the configs defined earlier are used as a base that is updated with values from configs defined later. New content is added, and content with the exact same key is updated/replaced, e.g.
Base file:
[file]
name = "first.toml"
[content]
square = 4
Second file:
[file]
name = "second.toml"
[content]
triangle = 3
Result:
[file]
name = "second.toml"
[content]
square = 4
triangle = 3
It’s important to note that the config files are loaded ‘left-to-right’, so when multiple configuration files are loaded, only the right-most value for any overlapping keys will be retained.
Reading config#
To use the cpg_utils.config functions, import get_config into
any code:
from cpg_utils.config import get_config
The first call to get_config sets the global config dictionary and
returns the content, subsequent calls will just return the config
dictionary.
assert get_config()['file'] == 'second.toml'
Because configuration is loaded lazily, start-up overhead is minimal, but can result in late failures if files with invalid content are specified.
Config outside Analysis-Runner#
The config utility can be used outside analysis-runner and CPG
infrastructure, requiring the user to manually set the config file(s) to
be read. Configuration files can be set in two ways:
Set the
CPG_CONFIG_PATHenvironment variableUse
set_config_pathsto point to one or more config TOMLs:from cpg_utils.config import set_config_paths
You can refer to the example configuration TOML in this repository and use it as a template.
API Reference#
Provides access to config variables.
- cpg_utils.config.append_config_paths(config_paths: list[str]) None[source]#
Append to the list of config paths. Any values in new configs will have the precedence over the existing CPG_CONFIG_PATH when merging the configs.
- cpg_utils.config.config_retrieve(key: list[str] | str, default: Any | None = <class 'cpg_utils.config.UnsuppliedDefault'>, config: frozendict[str, ~typing.Any] | dict[str, ~typing.Any] | None=None) Any[source]#
Retrieve key from config, assuming nested key specified as a list of strings.
>> config_retrieve([‘workflow’, ‘access_level’], config={‘workflow’: {‘access_level’: ‘test’}}) ‘test’
>> config_retrieve([‘workflow’, ‘access_level’], config={}, default=’default’) ‘default’
>> config_retrieve(‘workflow’, config={}) ConfigError(“Key ‘workflow’ not found in {}”)
>> config_retrieve([‘key1’, ‘key2’, ‘key3’], config={‘key1’: {‘key2’: {}}}) ConfigError(‘Key “key3” not found in {} (path: key1 -> key2)’)
Allow None as default value >> config_retrieve([‘key1’, ‘key2’, ‘key3’], config={}, default=None) is None True
- cpg_utils.config.cpg_test_dataset_path(suffix: str, category: str | None = None, dataset: str | None = None) str[source]#
CPG-specific method to get corresponding test paths when running from the main namespace.
- cpg_utils.config.dataset_path(suffix: str, category: str | None = None, dataset: str | None = None, test: bool = False) str[source]#
Returns a full path for the current dataset, given a category and a path suffix.
This is useful for specifying input files, as in contrast to the output_path function, dataset_path does _not_ take the workflow/output_prefix config variable into account.
Assumes the config structure like below, which is auto-generated by the analysis-runner:
```toml [workflow] access_level = “standard”
[storage.default] default = “gs://thousand-genomes-main” web = “gs://cpg-thousand-genomes-main-web” analysis = “gs://cpg-thousand-genomes-main-analysis” tmp = “gs://cpg-thousand-genomes-main-tmp” web_url = “https://main-web.populationgenomics.org.au/thousand-genomes”
[storage.thousand-genomes] default = “gs://cpg-thousand-genomes-main” web = “gs://cpg-thousand-genomes-main-web” analysis = “gs://cpg-thousand-genomes-main-analysis” tmp = “gs://cpg-thousand-genomes-main-tmp” web_url = “https://main-web.populationgenomics.org.au/thousand-genomes” ```
Examples
Assuming that the analysis-runner has been invoked with –dataset fewgenomes –access-level test:
> from cpg_utils.hail_batch import dataset_path > dataset_path(‘1kg_densified/combined.mt’) ‘gs://cpg-fewgenomes-test/1kg_densified/combined.mt’ > dataset_path(‘1kg_densified/report.html’, ‘web’) ‘gs://cpg-fewgenomes-test-web/1kg_densified/report.html’ > dataset_path(‘1kg_densified/report.html’, ‘web’, test=True) ‘gs://cpg-fewgenomes-test-web/1kg_densified/report.html’ > dataset_path(‘1kg_densified/report.html’, ‘web_url’) ‘https://main-web.populationgenomics.org.au/fewgenomes/1kg_densified/report.html’
Notes
If you specify test=True, the workflow/access_level config variable is required
- Parameters:
suffix (str) – A path suffix to append to the bucket.
category (str, optional) – A category like “tmp”, “web”, etc., defaults to “default” if omited.
dataset (str, optional) – Dataset name, takes precedence over the workflow/dataset config variable
test (bool) – Return “test” namespace version of the path
- Return type:
str
- cpg_utils.config.get_config(print_config: bool = False) frozendict[str, Any][source]#
Returns the configuration dictionary. Consider using config_retrieve(keys) instead.
Call set_config_paths beforehand to override the default path. See read_configs for the path value semantics.
Notes
Caches the result based on the config paths alone.
- Return type:
dict
- cpg_utils.config.get_config_paths() list[str][source]#
Returns the config paths that are used by subsequent calls to get_config.
If this isn’t called, the value of the CPG_CONFIG_PATH environment variable is used instead.
- Return type:
list[str]
- cpg_utils.config.get_cpg_namespace(access_level: str | None = None) str[source]#
Get storage namespace from the access level.
- cpg_utils.config.get_gcloud_set_project(gcp_project: str | None = None) str[source]#
Get the gcloud command to set the project.
- cpg_utils.config.image_path(key: str, version: str | list[str] | None = None, repository: str | None = None) str[source]#
Returns a path to a container image for the given key (i.e., image name) and version.
Examples
>> image_path(‘bcftools’, ‘1.16-1’) ‘australia-southeast1-docker.pkg.dev/cpg-common/images/bcftools:1.16-1’
- Parameters:
key (str) – Specifies the image name. When version is not specified: Describes the key within the images config section. Can list sections separated with ‘/’.
version (str or list[str], optional) – Specifies the desired image version, e.g., ‘1.18-1’, either directly as a version number string or indirectly via a config key list which will be used to retrieve a version number string via config_retrieve.
repository (str, optional) – The suffix (e.g., ‘dev’ for images-dev) of an artifact registry repository to be used instead of the default production images repository.
future (Using image_path(key) without giving version is deprecated. In)
:param : :param specifying it will be required.:
- Return type:
str
- cpg_utils.config.output_path(suffix: str, category: str | None = None, dataset: str | None = None, test: bool = False) str[source]#
Returns a full path for the given category and path suffix.
In contrast to the dataset_path function, output_path takes the workflow/output_prefix config variable into account.
Examples
If using the analysis-runner, the workflow/output_prefix would be set to the value provided using the –output argument, e.g.:
` analysis-runner --dataset fewgenomes --access-level test --output 1kg_pca/v42` ... `will use ‘1kg_pca/v42’ as the base path to build upon in this method:> from cpg_utils.hail_batch import output_path > output_path(‘loadings.ht’) ‘gs://cpg-fewgenomes-test/1kg_pca/v42/loadings.ht’ > output_path(‘report.html’, ‘web’) ‘gs://cpg-fewgenomes-test-web/1kg_pca/v42/report.html’
Notes
Requires the workflow/output_prefix config variable to be set, in addition to the requirements for dataset_path.
- Parameters:
suffix (str) – A path suffix to append to the bucket + output directory.
category (str, optional) – A category like “tmp”, “web”, etc., defaults to “default” if ommited.
dataset (str, optional) – Dataset name, takes precedence over the workflow/dataset config variable
test (bool, optional) – Boolean - if True, generate a test bucket path. Default to False.
- Return type:
str
- cpg_utils.config.prepend_config_paths(config_paths: list[str]) None[source]#
Prepend to the list of config paths. Equivalent to dict.set_defaults: any values in current CPG_CONFIG_PATH will have the precedence over the provided config_paths when merging the configs.
- cpg_utils.config.read_configs(config_paths: list[str]) frozendict[str, Any][source]#
Creates a merged configuration from the given config paths. This does NOT affect any state, re get_config.
For a list of configurations (e.g. [‘base.toml’, ‘override.toml’]), the configurations get applied from left to right. I.e. the first config gets updated by values of the second config, etc.
- Return type:
dict
- cpg_utils.config.reference_path(key: str) str[source]#
Returns a path to a reference resource using key in config’s “references” section.
Examples
>> reference_path(‘vep_mount’) ‘gs://cpg-common-main/references/vep/105.0/mount’ >> reference_path(‘broad/genome_calling_interval_lists’) ‘gs://cpg-common-main/references/hg38/v0/wgs_calling_regions.hg38.interval_list’
Assuming config structure as follows:
`toml [references] vep_mount = 'gs://cpg-common-main/references/vep/105.0/mount' [references.broad] genome_calling_interval_lists = 'gs://cpg-common-main/references/hg38/v0/wgs_calling_regions.hg38.interval_list' `- Parameters:
key (str) – Describes the key within the references config section. Can list sections separated with ‘/’.
- Return type:
str
- cpg_utils.config.set_config_paths(config_paths: list[str]) None[source]#
Sets the config paths that are used by subsequent calls to get_config.
If this isn’t called, the value of the CPG_CONFIG_PATH environment variable is used instead.
- Parameters:
config_paths (list[str]) – A list of cloudpathlib-compatible paths to TOML files containing configurations.
- cpg_utils.config.try_get_ar_guid()[source]#
Attempts to get the AR GUID from the environment.
This is a fallback for when the AR GUID is not available in the config.