Project

General

Profile

Actions

Bug #3113

closed

Some crunch tasks miss some input files and have others duplicated

Added by Abram Connelly over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
07/09/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

When running pipeline instance qr1hi-d1hrv-qh1k1mg3qqokv1s on collection 0938aa40406992ee7c02666b2708fbce+73567 , some input files do not get processed and others are duplicated.

The program from the above pipeline consists of a single job, calling 'one_task_per_input_file' on the collection:

#!/usr/bin/env python
#
# Simple program to expose dropped input files and duplicated input files.
#

import arvados
import os
import sys
import subprocess as sp

arvados.job_setup.one_task_per_input_file( if_sequence=0, and_end_task=True, input_as_path=True )

this_job = arvados.current_job()
this_task = arvados.current_task()
this_task_input = this_task['parameters']['input']

work_dir  = os.environ['CRUNCH_SRC']
mount_dir = os.environ['TASK_KEEPMOUNT']

input_filename = arvados.get_task_param_mount('input')

out_dir = os.path.join( arvados.current_task().tmpdir, "output" )
os.mkdir( out_dir )

out_fn = os.path.join( out_dir, "dummyfile" )
print "INPUTFILE:", input_filename

dummyExec = os.path.join( work_dir, "crunch_scripts/multipleFileBug/writedummy.sh" )
pOut = sp.check_output( [ dummyExec, "dummytext", out_fn ] )

out = arvados.CollectionWriter()
out.write_directory_tree( out_dir, max_manifest_depth=0 )
this_task.set_output( out.finish() )
sys.exit(0)

In the output log 'd3bee5816a311e5cee846e06ef2b97ec+89' should produce 863 unique file paths on the 'INPUTFILE:' lines, instead it porduces 860. The input file 'chr8_band2_s6200000_e12700000.bedGraph' appears four times when it should only appear once. The files 'chr11_band4_s16200000_e21700000.bedGraph', 'chr13_band6_s23300000_e25500000.bedGraph' and 'chr18_band0_s0_e2900000.bedGraph' should appear but are missed.


Subtasks 1 (0 open1 closed)

Task #3218: Review 3113-qsequence-serialResolvedBrett Smith07/09/2014

Actions
Actions

Also available in: Atom PDF