Writing a Script Calling a Third Party Tool » History » Revision 6
« Previous |
Revision 6/22
(diff)
| Next »
Sarah Guthrie, 04/06/2016 08:11 PM
- Table of contents
- Writing a Script Calling a Third Party Tool
- Writing a Dockerfile
- How to build a docker image from a Dockerfile
- How to upload a docker image to Arvados
- How to call an external tool from a crunch script
- Where to put temporary files
- How to write data directly to Keep (Using TaskOutputDir)
- When TaskOutputDir is not the correct choice
- Putting it all together
Writing a Script Calling a Third Party Tool¶
Case study: FastQC
Good tips include:- Keep the Dockerfile in the git repository
Writing a Dockerfile¶
Docker has some wonderful documentation for building Dockerfiles:- A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
- Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
From Docker:
"""
Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
"""
FROM arvados/jobs
USER root
RUN apt-get -q update && apt-get -qy install \
fontconfig \
openjdk-6-jre-headless \
perl \
unzip \
wget
USER crunch
RUN mkdir /home/crunch/fastqc
RUN cd /home/crunch/fastqc && \
wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
How to build a docker image from a Dockerfile¶
docker build -t username/imagename path/to/Dockerfile/
How to upload a docker image to Arvados¶
arv keep put username/imagename
How to call an external tool from a crunch script¶
We strongly recommend using the subprocess module for calling external tools. If the output is small and written to standard out, using subprocess.check_output will ensure the tool completed successfully and return the standard output.
import subprocess foo = subprocess.check_output(['echo','foo'])
If the output is big, subprocess.check_call can redirect it to a file while ensuring the tool completed successfully.
import subprocess
with open('foo', 'w') as outfile:
subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
FastQC writes to the current output directory or the output directory specified by the -o flag, so we can use subprocess.check_call
import subprocess
import arvados
#Grab the file path pointing to the file to run fastqc on
fastq_file = arvados.getjobparam('input_fastq_file')
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
subprocess.check_call(cmd)
Where to put temporary files¶
import arvados task = arvados.current_task() tmpdir = task.tmpdir
Inside the code:
import subprocess
import arvados
task = arvados.current_task()
tmpdir = task.tmpdir
#Grab the file path pointing to the file to run fastqc on
fastq_file = arvados.getjobparam('input_fastq_file')
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
subprocess.check_call(cmd)
How to write data directly to Keep (Using TaskOutputDir)¶
import arvados
import arvados.crunch
import os
outdir = arvados.crunch.TaskOutputDir()
with open(os.path.join(outdir.path, 'foo'), 'w') as outfile:
subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
arvados.task_set_output(outdir.manifest_text())
When TaskOutputDir is not the correct choice¶
- If the tool writes symbolic links or named pipes, which are not supported by fuse
- If the I/O access patterns are not performant with fuse
- This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
import arvados
import os
task = arvados.current_task()
tmpdir = task.tmpdir
os.mkdir(os.path.join(tmpdir, 'out'))
with open(os.path.join(tmpdir, 'out', 'foo.txt'), 'w') as out:
subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
collection_writer = arvados.collection.CollectionWriter()
collection_writer.write_file('random_file.txt')
collection_writer.write_directory_tree(os.path.join(tmpdir, 'out'))
arvados.task_set_output(collection_writer.finish())
Putting it all together¶
import subprocess
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc']
fq_files = sorted(glob.glob('*.fq*'))
fastq_files = sorted(glob.glob('*.fastq*'))
cmd.extend(fq_files+fastq_files)
cmd.extend(['-o', outdirpath, '-t', str(num_threads)])
fastqc_pipe = subprocess.Popen(cmd)
fastqc_pipe.wait()
coll_writer = arvados.CollectionWriter()
coll_writer.write_directory_tree(outdirpath)
pdh = coll_writer.finish()
body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0}
arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute()
Updated by Sarah Guthrie almost 10 years ago · 22 revisions