



Writing a Script Calling a Third Party Tool » History » Revision 15

« Previous | Revision 15/22 (diff) | Next »
Sarah Guthrie, 04/06/2016 08:49 PM

Writing a Script Calling a Third Party Tool

Case study: FastQC

Writing a Dockerfile

We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them.

Docker has some wonderful documentation for building Dockerfiles:

Explanation about Dockerfiles from docker:

Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.

This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices ( for a tip-oriented guide.

Dockerfile that installs FastQC:

FROM arvados/jobs

USER root

RUN apt-get -q update && apt-get -qy install \
  fontconfig \
  openjdk-6-jre-headless \
  perl \
  unzip \

USER crunch

RUN mkdir /home/crunch/fastqc
RUN cd /home/crunch/fastqc && \
    wget --quiet && \
    unzip /home/crunch/fastqc/

How to build a docker image from a Dockerfile

docker build -t username/imagename path/to/Dockerfile/

How to upload a docker image to Arvados

arv keep put username/imagename

How to call an external tool from a crunch script

We strongly recommend using the subprocess module for calling external tools. If the output is small and written to standard out, using subprocess.check_output will ensure the tool completed successfully and return the standard output.

import subprocess
foo = subprocess.check_output(['echo','foo'])

If the output is big, subprocess.check_call can redirect it to a file while ensuring the tool completed successfully.

import subprocess
with open('foo', 'w') as outfile:
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)

FastQC writes to the current output directory or the output directory specified by the -o flag, so we can use subprocess.check_call

import subprocess
import arvados

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]

Where to put temporary files

import arvados

task = arvados.current_task()
tmpdir = task.tmpdir

Inside the code:

import subprocess
import arvados

task = arvados.current_task()
tmpdir = task.tmpdir

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]

How to write data directly to Keep (Using TaskOutputDir)

import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

# Write to outdir.path


Inside the code:

import subprocess
import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]


When TaskOutputDir is not the correct choice

  • If the tool writes symbolic links or named pipes, which are not supported by fuse
  • If the I/O access patterns are not performant with fuse
    • This occurs in Tophat, which opens 20 file handles on multiple files that it writes out

Open a collection writer, write files and/or directory trees:

import arvados

collection_writer = arvados.collection.CollectionWriter()

Inside the code:

import subprocess
import arvados
import os

task = arvados.current_task()
tmpdir = task.tmpdir

outdir_path = os.path.join(tmpdir, 'out')

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]

collection_writer = arvados.collection.CollectionWriter()

Putting it all together

import subprocess
import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

#Grab the number of threads available
num_threads = multiprocessing.cpu_count()

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]


Updated by Sarah Guthrie almost 9 years ago · 22 revisions