====================================== Welcome to accessions's documentation! ====================================== .. include:: ../README.rst :start-after: short-intro-begin :end-before: short-intro-end Detailed Argument Description ============================= Metadata Json -------------- | This is an output of a pipeline analysis run. This can either be an actual JSON file or a Caper workflow ID/label. To use Caper IDs you must have set up access to a Caper server on the machine you are running ``accession`` from. For details on configuring this, see :ref:`installation`. For details about metadata, `Cromwell documentation `_ Server ------ ``prod`` and ``dev`` indicate the server where the files are being accessioned to. ``dev`` points to ``_. The server parameter can be explicitly passed as ``test.encodedcc.org`` or ``encodeproject.org``. Lab and Award ------------- | These are unique identifiers that are expected to be already present on the ENCODE Portal. It is recommended to specify them as the environment variables ``DCC_LAB`` and ``DCC_AWARD``, respectively. However, you may also specify them on the command line using the ``--lab`` and ``--award`` flags. Specifying these parameters with flags will take override any values in your environment. The values correspond to the lab and award identifiers given by the ENCODE portal, e.g. ``/labs/foo/`` and ``U00HG123456``. | To set these variables in your environment, run the following two commands in your shell. You may wish to add these to your ``~/.bashrc``, ``~/.bash_profile``, ``~/.zshrc``, or similar to configure them for your shell so you don't need to set them every time. .. code-block:: console $ export DCC_LAB=XXXXXXXX $ export DCC_AWARD=yyyyyyyyyyy Pipeline Type ------------- | Use the ``--pipeline-type`` argument to identify the pipeline type to be accessioned, for instance ``mirna``. This name is used to identify the appropriate steps JSON to use as the accessioning template, so if you use ``mirna`` as above the code will look for the corresponding template at ``accession_steps/mirna_steps.json``. Dry Run ------- | The ``--dry-run`` flag can be used to determine if any files that would be accessioned have md5sums matching those of files already on the portal. When this flag is specified, nothing will be accessioned, and log messages will be printed indicating files from the WDL metadata with md5 matches on the portal. Specifying this flag overrides the ``--force`` option, as such nothing will be accessioned if both flags are specified. Force Accessioning of MD5-Identical Files ------------------------------------------ | The ``-f/--force`` flag can be used to force accessioning even if there are md5-identical files already on the portal. The default behavior is to not accession anything if md5 duplicates are detected. Logging Options ---------------- | The `--no-log-file` flag and `--log-file-path` argument allow for some control over the accessioning code logging. `--no-log-file` will always skip logging to a file, even if a log file path is specified. The code will log to stdout in either case. `--log-file-path` defaults to `accession.log`. Log files are always appended to, never overwritten. Accession Steps Template Format =============================== Accession Steps --------------- | The accession steps JSON file specifies the task and file names in the output metadata JSON and the order in which the files and metadata will be submitted. Accessioning code will selectively submit the specified files to the ENCODE Portal. You can find the appropriate accession steps for your pipeline run `here `_ A single step is configured in the following way: .. code-block:: javascript { "dcc_step_version": "/analysis-step-versions/kundaje-lab-atac-seq-trim-align-filter-step-v-1-0/", "dcc_step_run": "atac-seq-trim-align-filter-step-run-v1", "requires_replication": "true", "wdl_task_name": "filter", "wdl_files": [ { "callbacks": ["maybe_preferred_default"] "filekey": "nodup_bam", "output_type": "alignments", "file_format": "bam", "quality_metrics": ["cross_correlation", "samtools_flagstat"], "derived_from_files": [ { "derived_from_task": "trim_adapter", "derived_from_filekey": "fastqs", "derived_from_inputs": true, "disallow_tasks": ["crop"] } ] } ] } ``dcc_step_version`` and ``dcc_step_run`` must exist on the portal. ``requires_replication`` indicates that a given step and files should only be accessioned if the experiment is replicated. ``wdl_task_name`` is the name of the task that has the files to be accessioned. ``wdl_files`` specifies the set of files to be accessioned. ``filekey`` is a variable that stores the file path in the metadata file. ``output_type``, ``file_format``, and ``file_format_type`` are ENCODE specific metadata that are required by the Portal ``quality_metrics`` is a list of methods that will be called in during the accessioning to attach quality metrics to the file ``callbacks`` is an array of strings referencing methods of the appropriate ``Accession`` subclass. This can be used to change or add arbitrary properties on the file. ``derived_from_files`` specifies the list of files the current file being accessioned derives from. The parent files must have been accessioned before the current file can be submitted. ``derived_from_inputs`` is used when indicating that the parent files were not produced during the pipeline analysis. Instead, these files are initial inputs to the pipeline. Raw fastqs and genome references are examples of such files. ``derived_from_output_type`` is required in the case the parent file has a possible duplicate. ``disallow_tasks`` can be used to tell the accessioning code not to search particular branches of the workflow digraph when searching up for parent files. This is useful for situations where the workflow exhibits a diamond dependency leading to unwanted files in the ``derived_from``. Table of Contents ================== .. toctree:: :maxdepth: 2 development release license changelog Indices and tables ================== * :ref:`genindex` * :ref:`search`