Project

General

Profile

Actions

Pipeline Optimization » History » Revision 5

« Previous | Revision 5/31 (diff) | Next »
Bryan Cosca, 04/14/2016 07:10 PM


Pipeline Optimization

Crunchstat Summary

Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose runtime_constraints specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job.

How to install crunchstat-summary

$ git clone https://github.com/curoverse/arvados.git
$ cd arvados/tools/crunchstat-summary/
$ python setup.py build
$ python setup.py install --user

How to use crunchstat-summary

$ ./bin/crunchstat-summary --help
usage: crunchstat-summary [-h]
                          [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE]
                          [--skip-child-jobs] [--format {html,text}]
                          [--verbose]

Summarize resource usage of an Arvados Crunch job

optional arguments:
  -h, --help            show this help message and exit
  --job UUID            Look up the specified job and read its log data from
                        Keep (or from the Arvados event log, if the job is
                        still running)
  --pipeline-instance UUID
                        Summarize each component of the given pipeline
                        instance
  --log-file LOG_FILE   Read log data from a regular file
  --skip-child-jobs     Do not include stats from child jobs
  --format {html,text}  Report format
  --verbose, -v         Log more information (once for progress, twice for
                        debug)

--text mode
using node recommendations, keep cache size

--html mode
check if you're cpu/io bound
check if tasks are being weird, i.e. gatk queue case

when to pipe and when to write to keep
in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound.

choosing the right number of jobs

each job must output a collection, so if you don't want to output a file, then

Job Optimization

How to optimize the number of tasks when you don't have native multithreading

tools like gatk, blah blah have native multithreading where you pass a -t.
tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval.
example here

piping between tools or writing to a tmpdir.

Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the python subprocess module

Updated by Bryan Cosca almost 9 years ago · 31 revisions