Welcome to accessions’s documentation!¶

accession is a Python module and command line tool for submitting genomics pipeline analysis output files and metadata to the ENCODE Portal.

Installation¶

Note: installation requires Python >= 3.8

$ pip install accession

Next, provide your API keys from the ENCODE portal:

$ export DCC_API_KEY=XXXXXXXX
$ export DCC_SECRET_KEY=yyyyyyyyyyy

It is highly recommended to set the DCC_LAB and DCC_AWARD environment variables for ease of use. These correspond to the lab and award identifiers given by the ENCODE portal, e.g. /labs/foo/ and U00HG123456, respectively.

$ export DCC_LAB=XXXXXXXX
$ export DCC_AWARD=yyyyyyyyyyy

If you are accessioning workflows produced using the Caper local backend, then installation is complete. However, if using WDL metadata from pipeline runs on Google Cloud, you will also need to authenticate with Google Cloud. Run the following two commands and follow the prompts:

$ gcloud auth login --no-launch-browser
$ gcloud auth application-default login --no-launch-browser

If you would like to be able to pass Caper workflow IDs or labels you will need to configure access to the Caper server. If you are invoking accession from a machine where you already have a Caper set up, and you have the Caper configuration file available at ~/.caper/default.conf, then there is no extra setup required. If the Caper server is on another machine, you will need so configure HTTP access to it by setting the hostname and port values in the Caper conf file.

(Optional) Finally, to enable using Cloud Tasks to upload files from Google Cloud Storage to AWS S3, set the following two environment variables. If one or more of them is not set, then files will be uploaded using the same machine that the accessioning code is run from. For more information on how to set up Cloud Tasks and the upload service, see the docs for the gcs-s3-transfer-service

$ export ACCESSION_CLOUD_TASKS_QUEUE_NAME=my-queue
$ export ACCESSION_CLOUD_TASKS_QUEUE_REGION=us-west1

To accession workflows produced on AWS backend you will need to set up AWS credentials. The easiest way to do this is to install the AWS CLI and run aws configure

Usage¶

$ accession -m metadata.json \
            -p mirna \
            -s dev

Please see the docs for greater detail on these input parameters.

Deploying on Google Cloud¶

First authenticate with Google Cloud via gcloud auth login if needed. Then install the API client with pip install google-api-python-client, it is recommended to do this inside of a venv. Finally, create the firewall rule and deploy the instance by running python deploy.py –project $PROJECT. This will also install the accession package. Finally, SSH onto the new instance and run gcloud auth login to authenticate on the instance.

For Caper integration, once the instance is up, SSH onto it and create the Caper conf file at ~/.caper/default.conf, use the private IP of the Caper VM instance as the hostname and use 8000 for the port. For the connection to work the Caper VM will need to have the tag caper-server. Also note that the deployment assumes the Cromwell server port is set to 8000.

AWS Notes¶

To enable S3 to S3 copy from the pipeline buckets to the ENCODE buckets, ensure that the pipeline bucket policy grants read access to the ENCODE account. Here is an example policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DelegateS3AccessGet",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::618537831167:root",
                    "arn:aws:iam::159877419961:root"
                ]
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::PIPELINE-BUCKET/*"
        },
        {
            "Sid": "DelegateS3AccessList",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::618537831167:root",
                    "arn:aws:iam::159877419961:root"
                ]
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::PIPELINE-BUCKET"
        }
    ]
}

Detailed Argument Description¶

Metadata Json¶

This is an output of a pipeline analysis run. This can either be an actual JSON file or a Caper workflow ID/label. To use Caper IDs you must have set up access to a Caper server on the machine you are running accession from. For details on configuring this, see Installation. For details about metadata, Cromwell documentation

Server¶

prod and dev indicate the server where the files are being accessioned to. dev points to https://test.encodedcc.org. The server parameter can be explicitly passed as test.encodedcc.org or encodeproject.org.

Lab and Award¶

These are unique identifiers that are expected to be already present on the ENCODE Portal. It is recommended to specify them as the environment variables DCC_LAB and DCC_AWARD, respectively. However, you may also specify them on the command line using the --lab and --award flags. Specifying these parameters with flags will take override any values in your environment. The values correspond to the lab and award identifiers given by the ENCODE portal, e.g. /labs/foo/ and U00HG123456.

To set these variables in your environment, run the following two commands in your shell. You may wish to add these to your ~/.bashrc, ~/.bash_profile, ~/.zshrc, or similar to configure them for your shell so you don’t need to set them every time.

$ export DCC_LAB=XXXXXXXX
$ export DCC_AWARD=yyyyyyyyyyy

Pipeline Type¶

Use the --pipeline-type argument to identify the pipeline type to be accessioned, for instance mirna. This name is used to identify the appropriate steps JSON to use as the accessioning template, so if you use mirna as above the code will look for the corresponding template at accession_steps/mirna_steps.json.

Dry Run¶

The --dry-run flag can be used to determine if any files that would be accessioned have md5sums matching those of files already on the portal. When this flag is specified, nothing will be accessioned, and log messages will be printed indicating files from the WDL metadata with md5 matches on the portal. Specifying this flag overrides the --force option, as such nothing will be accessioned if both flags are specified.

Force Accessioning of MD5-Identical Files¶

The -f/--force flag can be used to force accessioning even if there are md5-identical files already on the portal. The default behavior is to not accession anything if md5 duplicates are detected.

Logging Options¶

The –no-log-file flag and –log-file-path argument allow for some control over the accessioning code logging. –no-log-file will always skip logging to a file, even if a log file path is specified. The code will log to stdout in either case. –log-file-path defaults to accession.log. Log files are always appended to, never overwritten.

Accession Steps Template Format¶

Accession Steps¶

The accession steps JSON file specifies the task and file names in the output metadata JSON and the order in which the files and metadata will be submitted. Accessioning code will selectively submit the specified files to the ENCODE Portal. You can find the appropriate accession steps for your pipeline run here

A single step is configured in the following way:

{
    "dcc_step_version": "/analysis-step-versions/kundaje-lab-atac-seq-trim-align-filter-step-v-1-0/",
    "dcc_step_run": "atac-seq-trim-align-filter-step-run-v1",
    "requires_replication": "true",
    "wdl_task_name": "filter",
    "wdl_files": [
        {
          "callbacks": ["maybe_preferred_default"]
          "filekey": "nodup_bam",
          "output_type": "alignments",
          "file_format": "bam",
          "quality_metrics": ["cross_correlation", "samtools_flagstat"],
          "derived_from_files": [
            {
              "derived_from_task": "trim_adapter",
              "derived_from_filekey": "fastqs",
              "derived_from_inputs": true,
              "disallow_tasks": ["crop"]
            }
          ]
        }
    ]
}

dcc_step_version and dcc_step_run must exist on the portal.

requires_replication indicates that a given step and files should only be accessioned if the experiment is replicated.

wdl_task_name is the name of the task that has the files to be accessioned.

wdl_files specifies the set of files to be accessioned.

filekey is a variable that stores the file path in the metadata file.

output_type, file_format, and file_format_type are ENCODE specific metadata that are required by the Portal

quality_metrics is a list of methods that will be called in during the accessioning to attach quality metrics to the file

callbacks is an array of strings referencing methods of the appropriate Accession subclass. This can be used to change or add arbitrary properties on the file.

derived_from_files specifies the list of files the current file being accessioned derives from. The parent files must have been accessioned before the current file can be submitted.

derived_from_inputs is used when indicating that the parent files were not produced during the pipeline analysis. Instead, these files are initial inputs to the pipeline. Raw fastqs and genome references are examples of such files.

derived_from_output_type is required in the case the parent file has a possible duplicate.

disallow_tasks can be used to tell the accessioning code not to search particular branches of the workflow digraph when searching up for parent files. This is useful for situations where the workflow exhibits a diamond dependency leading to unwanted files in the derived_from.