Bug #3113
closedSome crunch tasks miss some input files and have others duplicated
100%
Description
When running pipeline instance qr1hi-d1hrv-qh1k1mg3qqokv1s on collection 0938aa40406992ee7c02666b2708fbce+73567 , some input files do not get processed and others are duplicated.
The program from the above pipeline consists of a single job, calling 'one_task_per_input_file' on the collection:
#!/usr/bin/env python # # Simple program to expose dropped input files and duplicated input files. # import arvados import os import sys import subprocess as sp arvados.job_setup.one_task_per_input_file( if_sequence=0, and_end_task=True, input_as_path=True ) this_job = arvados.current_job() this_task = arvados.current_task() this_task_input = this_task['parameters']['input'] work_dir = os.environ['CRUNCH_SRC'] mount_dir = os.environ['TASK_KEEPMOUNT'] input_filename = arvados.get_task_param_mount('input') out_dir = os.path.join( arvados.current_task().tmpdir, "output" ) os.mkdir( out_dir ) out_fn = os.path.join( out_dir, "dummyfile" ) print "INPUTFILE:", input_filename dummyExec = os.path.join( work_dir, "crunch_scripts/multipleFileBug/writedummy.sh" ) pOut = sp.check_output( [ dummyExec, "dummytext", out_fn ] ) out = arvados.CollectionWriter() out.write_directory_tree( out_dir, max_manifest_depth=0 ) this_task.set_output( out.finish() ) sys.exit(0)
In the output log 'd3bee5816a311e5cee846e06ef2b97ec+89' should produce 863 unique file paths on the 'INPUTFILE:' lines, instead it porduces 860. The input file 'chr8_band2_s6200000_e12700000.bedGraph' appears four times when it should only appear once. The files 'chr11_band4_s16200000_e21700000.bedGraph', 'chr13_band6_s23300000_e25500000.bedGraph' and 'chr18_band0_s0_e2900000.bedGraph' should appear but are missed.
Updated by Tim Pierce over 10 years ago
I haven't been able to reproduce this issue. I cloned git@git.qr1hi.arvadosapi.com:abram.git
and copied the multipleFileBug directory into my 4xphq repository, and ran pipeline https://workbench.4xphq.arvadosapi.com/pipeline_instances/4xphq-d1hrv-022dunoh2pt5sdd. The pipeline completed successfully, but generated the empty collection and did not produce a diagnostic log. According to the job API response, the expected 864 tasks were run successfully.
Updated by Tim Pierce over 10 years ago
- Category set to Crunch
- Status changed from New to Rejected
- Target version changed from 2014-07-16 Sprint to Deferred
Rejecting this bug -- let's revisit it if we can reproduce the problem reliably.
Updated by Tim Pierce over 10 years ago
- Status changed from Rejected to New
- Target version changed from Deferred to 2014-08-06 Sprint
Returning this to "new" -- there are other bugs (#3212 at least) preventing me from adequately determining whether this can be reproduced. Will attempt to reproduce again after that has been resolved.
Updated by Brett Smith over 10 years ago
It looks like the SDK set up the tasks correctly:
>>> tasks = api.job_tasks().list(filters=[['job_uuid', '=', 'qr1hi-8i9sb-j0gyutpct11dcnb'], ['sequence', '=', '1']], limit=1000).execute() >>> len(tasks['items']) 863 >>> inputs = set(t['parameters']['input'] for t in tasks['items']) >>> len(inputs) 863 >>> any(inp for inp in inputs if inp.endswith('chr13_band6_s23300000_e25500000.bedGraph')) True >>> any(inp for inp in inputs if inp.endswith('chr11_band4_s16200000_e21700000.bedGraph')) True
Updated by Brett Smith over 10 years ago
Here's the problem:
>>> set(t['qsequence'] for t in tasks['items']) set([None])
crunch-job fetches tasks ordered by qsequence. If that's not set, then it could happen that when crunch-job fetches tasks from the API server, it will get some twice, and others not at all.
JobTask on the API server has a callback after_update :assign_created_job_tasks_qsequence_if_succeeded
. It seems likely this is buggy and needs fixing.
Updated by Brett Smith over 10 years ago
- Assigned To changed from Tim Pierce to Brett Smith
Updated by Tom Clegg over 10 years ago
- Target version changed from 2014-08-06 Sprint to 2014-07-16 Sprint
Updated by Tom Clegg over 10 years ago
In services/api/test/unit/job_task_test.rb
I think it would be slightly safer to use create!
instead of create
here. Otherwise (I think) it's possible for object creation to be abandoned due to some filter/hook that runs earlier than set_default_qsequence
, which would let the test pass silently without really checking anything:
task = JobTask.create(qsequence: 99)
That's the only complaint I can come up with.
(As I got to the end of the diff I was thinking, "all I can think to add is a test to establish that a client-supplied qsequence doesn't get overwritten with the default" ...and there it was right at the end of the diff. Thanks!)
Updated by Tom Clegg over 10 years ago
- Status changed from In Progress to Resolved
Updated by Brett Smith over 10 years ago
Now that my bugfix has deployed to staging, I tested it out. I used the following crunch script, which just spawns a bunch of tasks (how many is specified as a job parameter) and has each one write out a line. With enough tasks spawned, duplicate/missing tasks should be noticeable.
#!/usr/bin/env python import arvados output = arvados.CollectionWriter() output.set_current_file_name("log.txt") task_count = int(arvados.getjobparam('task_count')) task_num_fmt = '{{:0{}d}}'.format(len(str(task_count))) def seq0(cur_task): task_body = {'job_uuid': arvados.current_job()['uuid'], 'created_by_job_task_uuid': cur_task['uuid'], 'sequence': 1, 'parameters': {'task_num': None}} make_task = arvados.api('v1').job_tasks().create for n in xrange(1, task_count + 1): task_body['parameters']['task_num'] = n make_task(body=task_body).execute() output.write("Origin task created {} tasks\n".format(task_count)) def seq1(cur_task): task_num_s = task_num_fmt.format(cur_task['parameters']['task_num']) output.write("Task {} has qsequence {:09d}\n".format( task_num_s, int(cur_task['qsequence']))) _current_task = arvados.current_task() _seq_func = locals()['seq{:d}'.format(int(_current_task['sequence']))] _seq_func(_current_task) _current_task.set_output(output.finish())
I used the following script to check the output and make sure that every task had one output line (and that their qsequence numbers increased monotonically):
#!/usr/bin/env python3 import fileinput import sys def numbers(): global expected lines = fileinput.input() line1 = next(lines) expected = int(line1.split()[3]) for line in lines: words = line.split() yield int(words[1]), int(words[4]) task_nums = set() last_qseq = -1 for tnum, qseq in numbers(): if tnum in task_nums: print("WARNING:", tnum, "duplicated") else: task_nums.add(tnum) if qseq <= last_qseq: print("WARNING: bad qseq step", last_qseq, qseq, "from", tnum) else: last_qseq = qseq diff = set(range(1, expected + 1)).difference(task_nums) if diff: print(diff) sys.exit(1)
Invoked to run ~1000 tasks, this job tickled the bug immediately on qr1hi (one task was run five times!), but three runs on staging passed with flying colors. With these results, I believe that the previous bugfix actually fixed the bug, so I'm closing this ticket.