Actions
Bug #3113
closedSome crunch tasks miss some input files and have others duplicated
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/09/2014
Due date:
% Done:
100%
Estimated time:
(Total: 0.00 h)
Story points:
1.0
Description
When running pipeline instance qr1hi-d1hrv-qh1k1mg3qqokv1s on collection 0938aa40406992ee7c02666b2708fbce+73567 , some input files do not get processed and others are duplicated.
The program from the above pipeline consists of a single job, calling 'one_task_per_input_file' on the collection:
#!/usr/bin/env python # # Simple program to expose dropped input files and duplicated input files. # import arvados import os import sys import subprocess as sp arvados.job_setup.one_task_per_input_file( if_sequence=0, and_end_task=True, input_as_path=True ) this_job = arvados.current_job() this_task = arvados.current_task() this_task_input = this_task['parameters']['input'] work_dir = os.environ['CRUNCH_SRC'] mount_dir = os.environ['TASK_KEEPMOUNT'] input_filename = arvados.get_task_param_mount('input') out_dir = os.path.join( arvados.current_task().tmpdir, "output" ) os.mkdir( out_dir ) out_fn = os.path.join( out_dir, "dummyfile" ) print "INPUTFILE:", input_filename dummyExec = os.path.join( work_dir, "crunch_scripts/multipleFileBug/writedummy.sh" ) pOut = sp.check_output( [ dummyExec, "dummytext", out_fn ] ) out = arvados.CollectionWriter() out.write_directory_tree( out_dir, max_manifest_depth=0 ) this_task.set_output( out.finish() ) sys.exit(0)
In the output log 'd3bee5816a311e5cee846e06ef2b97ec+89' should produce 863 unique file paths on the 'INPUTFILE:' lines, instead it porduces 860. The input file 'chr8_band2_s6200000_e12700000.bedGraph' appears four times when it should only appear once. The files 'chr11_band4_s16200000_e21700000.bedGraph', 'chr13_band6_s23300000_e25500000.bedGraph' and 'chr18_band0_s0_e2900000.bedGraph' should appear but are missed.
Actions