Bug #3373: [Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection. - Arvados

Actions

Copy link

Bug #3373

closed

[Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.

Added by Tom Clegg over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Peter Amstutz

Category:

Sample Pipelines

Target version:

2014-08-27 Sprint

Start date:

07/30/2014

Due date:

% Done:

100%

Estimated time:

(Total: 4.00 h)

Story points:

2.0

Subtasks 5 (0 open — 5 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Status changed from In Progress to Resolved

Actions

Copy link

Updated by Tom Clegg over 10 years ago

In decompress-all.py

I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)

import sys
...
infile_parts = re.match(r"(^[a-f0-9]{32}\+\d+)(\+\S+)*(/.*)(/[^/]+)$", input_file)
if infile_parts == None:
    print >>sys.stderr, "Failed to parse input filename '%s' as a Keep file\n" % input_file
    sys.exit(1)

cr = arvados.CollectionReader(infile_parts.group(1))
...

I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.

This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic:

    if rc == 0:
        out = arvados.CollectionWriter()
        out.write_directory_tree(outdir, max_manifest_depth=0)
        task.set_output(out.finish())
    else:
        task.set_output(streamname + filereader.as_manifest()[1:])

In split_fastq

We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.
If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).
This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"
chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)
Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)

run-command

This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link
Print diagnostics to stderr, not stdout
There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?

arv-run-pipeline-instance

I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:

                             :body_object => {
                               :pipeline_instance => attributes.to_json
                             },

Actions

Copy link

Updated by Tom Clegg over 10 years ago

Category set to Sample Pipelines
Status changed from Resolved to In Progress

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Target version changed from 2014-08-06 Sprint to 2014-08-27 Sprint

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Target version deleted (~~2014-08-27 Sprint~~)

Tom Clegg wrote:

In decompress-all.py

I think it would be clearer to do all the error handling for re.match right away (and emit an error message!) rather than indenting the rest of the program. I also think "result" variable could be given a more descriptive name. And I think you forgot to import sys. (Thanks, Python.)

Fixed

I suspect the third regexp group should have a ? after it, so {hash}/foo.gz works.

Fixed

This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic

<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?" 
<brett> No.  It is not so easy to distinguish between those cases.

In split_fastq

We appear to silently ignore any fastq.gz files that aren't in the top-level stream of the manifest. Why not process everything regardless of stream name? If we can't do that correctly for some reason, it's probably better to fail rather than silently drop data.

Fixed.

If paired and unpaired reads appear in the input, we should not silently drop the unpaired reads. Perhaps the logic could be simplified to something like "for each file: if _1 then do a pair; elif _2 then skip; elif fastq then process unpaired", and check afterward that the number of files processed is equal to the number of files encountered (in order to detect a _2 with no corresponding _1).

Fixed.

This sort of thing should go to stderr: print "Finish piece ./_%s/%s (%s %s)"

Removed some debugging prints and changed the rest to go to stderr.

chunking size should be configurable (I see that chunking isn't even possible to enable, but this seems like a trivial change)

Left that alone until I revisit chunking with a more efficient algorithm and chunking is worth doing.

Add brief comment to chunking code to indicate development state (the only clue I see is that it's hard-coded to False, which suggests it is either untested or known to be broken)

Added comment.

run-command

This isn't new in this branch, but it looks like an extraneous space: subst.default_subs["link "] = sub_link

That's intentional. In the absence of more sophisticated parsing, the various special case substitutions are detected with str.startswith() so the space is there in avoid situations like matching the beginning of "$(link_name)" and calling it with "_name" as the argument.

Print diagnostics to stderr, not stdout

Uses Python logging module now, configured to go to stderr.

There's heavy use of a few different variables called "p" here. Could we have a more descriptive name, at least for the global?

Fixed global 'p' is now 'taskp'

arv-run-pipeline-instance

I think .to_json is no longer appropriate here (and in the 3 other similar instances), now that we're asking api_client to serialize body_object for us:

Fixed.

Actions

Copy link

Updated by Peter Amstutz over 10 years ago

Target version set to 2014-08-27 Sprint

Actions

Copy link

Updated by Tom Clegg over 10 years ago

Peter Amstutz wrote:

This looks suspiciously like "dtrx failed" is always interpreted as "file is uncompressed, pass through unchanged", which seems overly optimistic

<tetron> brett: does dtrx distinguish between "this file type is unknown/not an archive" and "this file appears to be corrupt?"
<brett> No. It is not so easy to distinguish between those cases.

We should pattern-match the filename to decide whether to try decompressing it, and then demand the expected behavior. This is more predictable and safer:

If something is called *.gz but isn't actually compressed (or decompression fails for some other reason), you fail.
If something is called *.txt and is compressed, we pass it through unchanged. If you really expected this to give you transparent decompression, you fail.

It looks like dtrx doesn't have anything like --list-supported-extensions but we can make a pretty good guess that \.(gz|Z|bz2|tgz|tbz|zip|rar|7z|cab|deb|rpm|cpio|gem)$ should work.

I'm pushing on this because I've already seen silently ignored decompression failures result in many days of wasted compute time downstream, not to mention troubleshooting time. Thanks, pigz, for exiting 0 so much.

In this case, if we assume failure means "already uncompressed", we get an unpredictable script which sometimes outputs "foo.txt.gz" and sometimes outputs "foo.txt", depending on whether some transient system error occurred, whether dtrx was installed, etc. Surely we'd rather be predictable, even if it means sacrificing the ability to transparently decompress gzip data that is disguised as *.txt. :P

Everything else LGTM, thanks!

Actions

Copy link

#10

Updated by Anonymous over 10 years ago

Status changed from In Progress to Resolved
% Done changed from 40 to 100

Applied in changeset arvados|commit:c5be0c3abd926adc54e0c1de65e8dfdd25a84ea1.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #3373

[Sample pipelines] Improve SNV pipeline to accept example exome fastq data (2 pairs of reads) as a single input collection.

Updated by Peter Amstutz over 10 years ago

Updated by Peter Amstutz over 10 years ago

Updated by Peter Amstutz over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Peter Amstutz over 10 years ago

Updated by Peter Amstutz over 10 years ago

Updated by Peter Amstutz over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Anonymous over 10 years ago