Writing a Script Calling a Third Party Tool » History » Revision 19
« Previous |
Revision 19/22
(diff)
| Next »
Sarah Guthrie, 04/07/2016 10:42 PM
- Table of contents
- Writing a Script Calling a Third Party Tool
- Case study: FastQC
- Writing a Dockerfile
- How to build a docker image from a Dockerfile
- How to upload a docker image to Arvados
- How to call an external tool from a crunch script
- Where to put temporary files
- How to write data directly to Keep (Using TaskOutputDir)
- When TaskOutputDir is not the correct choice
- The final crunch script
- Writing a pipeline template to run the crunch script
- Case study: FastQC
Writing a Script Calling a Third Party Tool¶
Case study: FastQC¶
- Building an environment able to run FastQC
- Writing a Dockerfile
- Building a docker image from the Dockerfile
- Uploading the docker image to an Arvados instance
- Writing a crunch script that runs FastQC (in the docker image)
- Calling FastQC
- Where to place temporary files
- Writing output data
- Writing a pipeline template to run the crunch script
Writing a Dockerfile¶
Dockerfiles, as explained by docker:
Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below:Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
- A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
- Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them.
Dockerfile that installs FastQC:
FROM arvados/jobs USER root RUN apt-get -q update && apt-get -qy install \ fontconfig \ openjdk-6-jre-headless \ perl \ unzip \ wget USER crunch RUN mkdir /home/crunch/fastqc RUN cd /home/crunch/fastqc && \ wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
How to build a docker image from a Dockerfile¶
Once you have a Dockerfile, you can use the docker build
command to build the image using the Dockerfile instructions.
docker build -t username/imagename path/to/Dockerfile/
How to upload a docker image to Arvados¶
Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command arv keep docker
to upload the image to an Arvados cluster.
arv keep docker username/imagename
How to call an external tool from a crunch script¶
We strongly recommend using the subprocess
module for calling external tools. If the output is small and written to standard out, using subprocess.check_output
will ensure the tool completed successfully and return the standard output.
import subprocess foo = subprocess.check_output(['echo','foo'])
If the output is big, subprocess.check_call
can redirect it to a file while ensuring the tool completed successfully.
import subprocess with open('foo', 'w') as outfile: subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
FastQC writes to the current output directory or the output directory specified by the -o
flag, so we can use subprocess.check_call
import subprocess import arvados #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] subprocess.check_call(cmd)
Where to put temporary files¶
import arvados task = arvados.current_task() tmpdir = task.tmpdir
Inside the code:
import subprocess import arvados task = arvados.current_task() tmpdir = task.tmpdir #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] subprocess.check_call(cmd)
How to write data directly to Keep (Using TaskOutputDir)¶
import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() # Write to outdir.path arvados.task_set_output(outdir.manifest_text())
Inside the code:
import subprocess import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] subprocess.check_call(cmd) arvados.task_set_output(outdir.manifest_text())
When TaskOutputDir is not the correct choice¶
- If the tool writes symbolic links or named pipes, which are not supported by fuse
- If the I/O access patterns are not performant with fuse
- This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
Open a collection writer, write files and/or directory trees:
import arvados collection_writer = arvados.collection.CollectionWriter() collection_writer.write_file('foo.txt') collection_writer.write_directory_tree(bar_directory_path) arvados.task_set_output(collection_writer.finish())
Inside the code:
import subprocess import arvados import os task = arvados.current_task() tmpdir = task.tmpdir outdir_path = os.path.join(tmpdir, 'out') os.mkdir(outdir_path) #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path] subprocess.check_call(cmd) collection_writer = arvados.collection.CollectionWriter() collection_writer.write_file('foo.txt') collection_writer.write_directory_tree(outdir_path) arvados.task_set_output(collection_writer.finish())
The final crunch script¶
import subprocess import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') #Grab the number of threads available num_threads = multiprocessing.cpu_count() cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)] subprocess.check_call(cmd) arvados.task_set_output(outdir.manifest_text())
Writing a pipeline template to run the crunch script¶
...
Updated by Sarah Guthrie almost 9 years ago · 22 revisions