CPG Infrastructure Developer Guide#

Our CPG Infrastructure is our way of defining resources to make our analysis possible. This guide is intended for developers who want to contribute to the CPG Infrastructure. See the CPG Infrastructure User Guide for information on how to use the CPG Infrastructure.

Repository structure#

The core of our infrastructure is defined in cpg-infrastructure, and is made up of 3 levels.

  1. CPGInfrastructure:
    • Infrastructure defined ONCE CPG wide, so all datasets can reuse them.

    • Manages group membership, see below for more details about that.

    • Coordinates the level below.

  2. CPGDatasetInfrastructure
    • Infrastructure defined ONCE for a specific dataset.

    • Pretty light, mostly the metamist project.

  3. CPGDatasetCloudInfrastructure`
    • All the resources for a dataset, for each specific cloud provider.

Configuration#

The CPG configures our infrastructure in a private repository, which is structured like:

  • config-infrastructure.yaml:
    • Contains general configuration with org constants and cloud provider details.

  • config-datasets.yaml:
    • Contains dataset specific configuration, like the dataset name, and the cloud environments to use for each dataset.

    • This matches the config.

    dataset:
      # other datasets this one need to READ but not write
      depends_on_readonly:
        - dataset-1
        - dataset-2
      # other datasets this one need to READ + WRITE
      depends_on:
        - dataset-3
      enable_release: true
      enable_shared_project: true
      # read the user.rst docs on these params
      is_internal_dataset: true
      gcp:
        project: dataset-gcp-project-id
    
  • users.yaml
    • Contains the users and their roles in the CPG infrastructure.

    • This is used to manage group memberships.

    firstname.lastname:
    can_access_internal_dataset_logs: true
    gcp:
      hail_batch_username: hailbatchusername
      # an email recognised by google identity, to add to google groups
      id: firstname.lastname@populationgenomics.org.au
    
  • projects.yaml
    • A list of legal projects / protocols that we use to double check people have access to the datasets. This isn’t used in the infra code directly, it’s a process check on top which blocks CI merges.

  • Pulumi.production.yaml
    • Contains the Pulumi configuration for the production environment.

    • Basically no extra information, except a pointer to the config.

  • datasets/$DATASET/*
    • budgets.yaml
      • Contains the budgets for the dataset.

      • See [Budgets](../budgets.md) for more information.

    • members.yaml
      • A map of a dataset group, to a list of users.

      • See the setup_externally_specified_members method for it’s implementation.

    • repositories.yaml:
      • A list of repositories for which you can run code against within the analysis-runner for this dataset. (Note, there are default repositories).

      • This is implemented in tokens/main.py in cpg-infra-private.

Group membership#

Groups are used to control access to resources and services. In most places, we add a group to a cloud resource, and let the cloud provider handle the rest. In other cases (like the web service / metamist / analysis-runner), we can’t just lookup the group membership from the cloud provider (because it’s slow, and it’s a little tedious to add members not defined within that cloud), so we have a members cache (a file that stores the group members).

We bring all group memberships under cpg-infrastructure, this allows us to:

  • unwrap groups completely, hence we know the full list of users / accounts that need to access specific resources.

  • better version control and store history of group memberships. In the CPG’s configuration, we have CODEOWNER policies to manage who can add / remove users from groups.

In code, we have a wrapper around a group, which is a list of GroupMember’s, which is either a user or itself another group. We track which cloud we need to create the group for. At anytime you can request a full list of users, however the ID component might be a pulumi output value, eg: you can’t know the service-account ID before you create it.

All dataset infrastructure methods can create groups, and add members to them, and at the very end, these group members are finalised, synced to the cloud resource, and any access group caches are updated.

How does it actually get deployed?#

In cpg-infrastructure-private, we have a deploy.yaml GitHub action. This:

  • Installs the latest cpg-infrastructure package

  • Runs the pulumi up command:
    • This internally runs the __main__.py file, which is the entry point for the pulumi program. This packages up all the configuration in the private repo into:

      1. A CPGInfrastructureConfig object, which contains users.

      2. A list[CPGDatasetConfig] objects, which contains the datasets to spin up.

Puluim does the rest!

Previewing the workflow locally#

These instructions will set you up with previewing the production stack locally. You can combine this with cpg-infra dev stack guide to create your own developer stack for infrastructure testing.

Warning

Pulumi tries to update the “last previewed” timestamp on the state file in the bucket. You may not have write access to this file, this isn’t a problem as we’d prefer NOT to write back to this file.

If you do have write access, CAUTION: if the pulumi preview is cancelled, it probably won’t write the state file correctly, so it can appear like the state file is damaged. You can restore an older version of the state file as the pulumi-state bucket is versioned.

Prerequisites:

  • Have installed:
    • Pulumi

    • google-cloud-sdk

    • azure-cli

  • Get and export the PULUMI_CONFIG_PASSPHRASE - ask a CPG team-member.

  • Have both cpg-infrastructure and cpg-infrastructure-private cloned locally

  • Set-up a python virtual environment, installed cpg-infrastructure.
    • Usually better to pip install -e . in the cpg-infrastructure directory.

Noting we set these environment variables:

  • PULUMI_EXPERIMENTAL: So we can use the skip checkpoints feature

  • PULUMI_SKIP_CHECKPOINTS: Allows us to skip checkpoints which take forever!

Some extra notes:

  • The –non-interactive flag is important because the interactive mode breaks for how many resources we have

  • The –diff flag is important because it shows us what pulumi is going to change, and it’s a bug: pulumi/pulumi#12162

pulumi login gs://cpg-pulumi-state/

# this is using the production stack, see this guide for alternative stacks
# https://github.com/populationgenomics/cpg-infrastructure-private/blob/main/docs/development.md

pulumi stack init production

PULUMI_EXPERIMENTAL=true PULUMI_SKIP_CHECKPOINTS=true pulumi preview \
    --non-interactive --diff -p 20