CPG Infrastructure User Guide#
This doc is a user guide for cpg-infrastructure configuration. See the CPG Infrastructure Developer Guide. for information on how to developer the CPG infrastructure.
Overview#
The CPG infrastructure is a set of resources that make our analysis possible. It includes resources like buckets for data storage, groups, budget management, and all the permissions to access these resources.
The cpg-infrastructure-private is where all of our configuration lives. You can modify these files to change the configuration of the infrastructure, and it gets automatically released on merge.
Quick guide#
This section has a bunch of quick guides to get you started!
How do I add someone to a dataset?#
Ensure the user is added to the
users.yamlfile.- Ensure the user is in a protocol/project in the
projects.yamlfile that covers the dataset. This requires an approval from the PM team (to ensure onboarding requirements are met).
- Ensure the user is in a protocol/project in the
- Ensure the user is added to the correct group in the
config/DATASET/members.yamlfile. This requires an approval from the dataset owner.
- Ensure the user is added to the correct group in the
Example PR: populationgenomics/cpg-infrastructure-private#340
What groups can I add a user to?#
web-accessAccess to web reports via the web server (main-web.populationgenomics.org.au/dataset/…)
uploadRead + Write permissions to the upload buckets
metadata-accessRead metadata in metamist, including access to the metamist UI
metadata-writeRead + write metadata in metamist (includes test projects)
analysisRead + contributor roles in metamist
Read permissions to the main analysis buckets
List permissions to all other main buckets
Read + write permissions to test buckets
Other permissions directly related to analysis, eg: dataproc logs
data-managerAll permissions from analysis
Read access to main buckets
Write access to metadata
Ability to generate credentials for the shared projects
release-accessAbility to read from the release bucket
How do I add a new dataset?#
You can run the handy scripts/create_dataset.py script, which creates _most_ of the entries needed to spin up a dataset on merge.
Some common options:
--dataset DATASET: The name of the dataset.--add-as-rare-disease-dependency: Add the dataset as a dependency to the rare disease dataset (which basically adds it to seqr).--budget 100: The monthly budget in AUD for the dataset.
python scripts/create_dataset.py \
--dataset $DATASET \
-- budget 100 \
--add-as-rare-disease-dependency
Common errors, and how to fix them#
User USER is not in the users.yaml file#
A user was added to a members.yaml file, but the system couldn’t find them in the users.yaml file. Make sure you add the user to the users.yaml file.
DATASET: Has user USER but, but there are no projects / protocols#
A dataset didn’t have any protocols / projects in the projects.yaml file. You’ll need to talk to the PM team about which protocol best covers your new dataset.
(This error is more common on dataset creation).
DATASET: USER does not have any of the required projects: PROJECT1, PROJECT2#
A user was added to a dataset, but they didn’t have any of the required projects / protocols in the projects.yaml file. You’ll need to add them to one of the listed projects, and get approval from the PM team.
You should reach out to your team-lead / the PM team for help with this.
What’s the projects.yaml?#
The projects.yaml file is a list of legal projects / protocols that we use to double check people have access to the datasets. This isn’t used in the infra code directly, it’s a process check on top which blocks CI merges.
For example, if USER1 wants access to DATASET, they must fulfill the legal onboarding requirements of PROJECT1, which could include undertaking specific training, or signing a legal agreement.
Note:
there may be multiple protocols that cover a dataset (eg: a project may have a protocol for each institution).
a protocol may cover multiple datasets, which may often happen for some grants.
The projects.yaml file is maintained by the PM team, hence any change to that file must be approved by the PM team.
A project looks like this:
protocol-name:
name: A test protocol that covers the dataset
datasets:
- DATASET1
members:
- user.name
- user2.name
If this file isn’t filled out correctly, you will get an automated comment on your PR. Some common error messages with the solution are listed above.