Developer Guidelines

Pull Request (PR) Guidelines

Every PR should have a corresponding JIRA ticket, and the name of the PR should have the ticket number as the first thing in its title, e.g. PIP-123 accession things. PRs can have as many commits as you like, they will be squashed when merging into dev anyway. Your PR must pass on CircleCI, otherwise it will not be reviewed.

Local Development Environment

First get an up to date checkout of the accession repository:

$ git clone git@github.com:ENCODE-DCC/accession.git

or if you want to use git via https:

$ git clone https://github.com/ENCODE-DCC/accession.git

Using a virtual environment is highly recommended here. Change into the newly created directory and install an editable version of accession along with its tests and docs requirements:

$ cd accession
$ python -m venv venv && source venv/bin/activate  # or your favorite environment manager
(venv) $ pip install -e '.[dev]'

To run tests:

$ tox -e py38

To autoformat and lint your code, run the following:

$ tox -e lint

To check the documentation builds, run:

$ tox -e docs

The built documentation can then be found in docs/_build/html/.

Writing Templates

Wherever possible, it is recommended to reuse parts of existing Jsonnet templates. You can find more information about Jsonnet here: https://jsonnet.org/ On MacOS, install with brew via brew install jsonnet.

When making changes to the Jsonnet files, generate the new templates like so:

$ tox -e conf

Make sure to rerun tox -e lint afterwards to pretty-format the resulting JSON before committing and pushing your code.

Building and Running Docker Images

The docker images for encoded are parametrized by the branch name or release tag, e.g. ENCD-1234-foo-bar or v90.0. To build it manually, run the following commands (relative to the repo root):

$ docker build . -f docker/Dockerfile -t [IMAGE TAG] --build-arg ENCODED_RELEASE_TAG=[TAG OR BRANCH]

To run the local app, map your desired host port (must not be in use, here using 8000) to port 8000 of the container:

$ docker run --rm -d -p 8000:8000 encoded-docker:test

Writing tests

You should always write tests whenever adding new code, or potentially update tests if you modify code. Throughout the repo’s test suite in the /tests folder you can find examples of different uses of mocking that you can use to help write your own tests.

Integration tests are more complicated to assemble, but there is infrastructure in place to make them easier to write, although they still require adding a lot of data to the repo. The required pieces of data are listed below, assuming you already have a Cromwell pipeline run:

  1. Get the Cromwell metadata json file, and put it in the tests/data folder

2. Add the base64-encoded md5sums for each gs:// file in the metadata json to the tests/data/gcloud_md5s.json file. You can obtain these with gsutil hash. The following command will parse the metadata and give you the gsutil output for each unique object.

$ cat tests/data/mirna_replicated_metadata.json | tr -s ' ' | tr ' ' '\n' | tr -d ',' | egrep "^\"gs://" | sort | uniq | grep "\." | xargs gsutil hash > hashes.txt

You will then need to encode this as JSON in the aforementioned file. Here is some Python that will print out file: md5sum entries you can just copy-paste into the JSON. IMPORTANT try not to duplicate keys in this JSON file. While it is techincally valid and wouldn’t be caught by either Python or the pre-commit hooks, it could be very confusing for someone looking in the file. Using a proper JSON linter will tell you if duplicate keys exist.

with open("hashes.txt") as f:
    data = f.readlines()

buffer = []
for i, line in enumerate(data):
    if (i - 1) % 3 == 0:
        continue
    elif i % 3 == 0:
        x = line.split(' for ')[-1].strip().rstrip(':')
        buffer.append(f'"gs://encode-processing/{x}"')
    elif (i + 1) % 3 == 0:
        y = line.split('(md5):')[-1].strip()
        buffer.append(f'"{y}",')
    if len(buffer) == 2:
        print(": ".join(buffer))
        buffer = []
  1. Download and add any QC JSON files from the metadata to the tests/data/files folder.

4. Add the appropriate test inserts to the tests/data/inserts folder. For any given experiment, you will likely need to add experiment, replicate, library, file, biosample, donor, biosample_type, lab, and award inserts. You need only to add raw files (fastqs) and reference files. If you are testing a new assay, you will also need to add analysis_step_version, analysis_step, software_version, and software inserts. The easiest way to add them is to get the JSON from the portal <https://www.encodeproject.org> with frame=edit, copy that JSON into the the insert, and then copy the UUID from the portal into the insert as well. You will want to replcace any “user” properties with the dummy user in the inserts, see them for examples. I also delete any instances of the documents property to avoid needing to add them to the inserts, they don’t affect the accessioning.

You will need to rebuild the docker image in order to add the inserts to the local test server. You may see errors loading the test data when starting the container, you can see the exact errors by looking at the container logs. You can then fix the inserts and rebuild.

5. Add the expected results to the tests/data/validation/{ASSAY}/files.json file. If you already have an accessioned example on a demo, you can simply GET the files with frame=embedded and copy-paste them into the validation JSON. The frame parameter is important, and saves us from needing separate validation files for the analysis_step_runs and quality_metrics. You will need to put the reference files in there as well, if they aren’t there already (those are OK to use frame=object). 6. That’s a lot of data to manage. Fortunately, writing the tests should be very simple. The accessioner_factory fixture will take care of setup and teardown of the test, including the test’s Docker container. Here is an example of a microRNA test:

def test_accession_mirna_unreplicated(accessioner_factory):
    factory = accessioner_factory(
        metadata_file="mirna_unreplicated_metadata.json", assay_name="mirna"
    )
    accessioner, expected_files = next(factory)
    accessioner.accession_steps()
    validate_accessioning(
        accessioner, expected_files, expected_num_files=6, dataset="ENCSR543MWW"
    )

Here, validate_accessioning is just a function that takes care of all the assertions, and can be reused by your tests as well. expected_num_files is the number of new files that you expect the accessioning to post.