Welcome to accessions’s documentation!¶
accession
is a Python module and command line tool for submitting genomics pipeline analysis output files and metadata to the ENCODE Portal.
Installation¶
Note: installation requires Python >= 3.8
$ pip install accession
Next, provide your API keys from the ENCODE portal:
$ export DCC_API_KEY=XXXXXXXX
$ export DCC_SECRET_KEY=yyyyyyyyyyy
/labs/foo/
and U00HG123456
, respectively.$ export DCC_LAB=XXXXXXXX
$ export DCC_AWARD=yyyyyyyyyyy
$ gcloud auth login --no-launch-browser
$ gcloud auth application-default login --no-launch-browser
accession
from
a machine where you already have a Caper set up, and you have the Caper configuration
file available at ~/.caper/default.conf
, then there is no extra setup required.
If the Caper server is on another machine, you will need so configure HTTP access to
it by setting the hostname
and port
values in the Caper conf file.$ export ACCESSION_CLOUD_TASKS_QUEUE_NAME=my-queue
$ export ACCESSION_CLOUD_TASKS_QUEUE_REGION=us-west1
Usage¶
$ accession -m metadata.json \
-p mirna \
-s dev
Please see the docs for greater detail on these input parameters.
Deploying on Google Cloud¶
AWS Notes¶
To enable S3 to S3 copy from the pipeline buckets to the ENCODE buckets, ensure that the pipeline bucket policy grants read access to the ENCODE account. Here is an example policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DelegateS3AccessGet",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::618537831167:root",
"arn:aws:iam::159877419961:root"
]
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::PIPELINE-BUCKET/*"
},
{
"Sid": "DelegateS3AccessList",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::618537831167:root",
"arn:aws:iam::159877419961:root"
]
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::PIPELINE-BUCKET"
}
]
}
Detailed Argument Description¶
Metadata Json¶
accession
from. For details on configuring
this, see Installation. For details about metadata,
Cromwell documentationServer¶
prod
and dev
indicate the server where the files are being accessioned to. dev
points to https://test.encodedcc.org. The server parameter can be explicitly passed as test.encodedcc.org
or encodeproject.org
.
Lab and Award¶
DCC_LAB
and
DCC_AWARD
, respectively. However, you may also specify them on the command line
using the --lab
and --award
flags. Specifying these parameters with flags will
take override any values in your environment. The values correspond to the lab and
award identifiers given by the ENCODE portal, e.g. /labs/foo/
and U00HG123456
.~/.bashrc
, ~/.bash_profile
,
~/.zshrc
, or similar to configure them for your shell so you don’t need to set
them every time.$ export DCC_LAB=XXXXXXXX
$ export DCC_AWARD=yyyyyyyyyyy
Pipeline Type¶
--pipeline-type
argument to identify the pipeline type to be
accessioned, for instance mirna
. This name is used to identify the appropriate
steps JSON to use as the accessioning template, so if you use mirna
as above the
code will look for the corresponding template at accession_steps/mirna_steps.json
.Dry Run¶
--dry-run
flag can be used to determine if any files that would be accessioned
have md5sums matching those of files already on the portal. When this flag is
specified, nothing will be accessioned, and log messages will be printed indicating
files from the WDL metadata with md5 matches on the portal. Specifying this flag
overrides the --force
option, as such nothing will be accessioned if both flags
are specified.Force Accessioning of MD5-Identical Files¶
-f/--force
flag can be used to force accessioning even if there are
md5-identical files already on the portal. The default behavior is to not accession
anything if md5 duplicates are detected.Logging Options¶
Accession Steps Template Format¶
Accession Steps¶
A single step is configured in the following way:
{
"dcc_step_version": "/analysis-step-versions/kundaje-lab-atac-seq-trim-align-filter-step-v-1-0/",
"dcc_step_run": "atac-seq-trim-align-filter-step-run-v1",
"requires_replication": "true",
"wdl_task_name": "filter",
"wdl_files": [
{
"callbacks": ["maybe_preferred_default"]
"filekey": "nodup_bam",
"output_type": "alignments",
"file_format": "bam",
"quality_metrics": ["cross_correlation", "samtools_flagstat"],
"derived_from_files": [
{
"derived_from_task": "trim_adapter",
"derived_from_filekey": "fastqs",
"derived_from_inputs": true,
"disallow_tasks": ["crop"]
}
]
}
]
}
dcc_step_version
and dcc_step_run
must exist on the portal.
requires_replication
indicates that a given step and files should only be accessioned if the experiment is replicated.
wdl_task_name
is the name of the task that has the files to be accessioned.
wdl_files
specifies the set of files to be accessioned.
filekey
is a variable that stores the file path in the metadata file.
output_type
, file_format
, and file_format_type
are ENCODE specific metadata that are required by the Portal
quality_metrics
is a list of methods that will be called in during the accessioning to attach quality metrics to the file
callbacks
is an array of strings referencing methods of the appropriate Accession
subclass. This can be used to change or add arbitrary properties on the file.
derived_from_files
specifies the list of files the current file being accessioned derives from. The parent files must have been accessioned before the current file can be submitted.
derived_from_inputs
is used when indicating that the parent files were not produced during the pipeline analysis. Instead, these files are initial inputs to the pipeline. Raw fastqs and genome references are examples of such files.
derived_from_output_type
is required in the case the parent file has a possible duplicate.
disallow_tasks
can be used to tell the accessioning code not to search particular branches of the workflow digraph when searching up for parent files. This is useful for situations where the workflow exhibits a diamond dependency leading to unwanted files in the derived_from
.