Bug #3113: Some crunch tasks miss some input files and have others duplicated - Arvados

Actions

Copy link

Bug #3113

closed

Some crunch tasks miss some input files and have others duplicated

Added by Abram Connelly almost 11 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Crunch

Target version:

2014-07-16 Sprint

Start date:

07/09/2014

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Description

When running pipeline instance qr1hi-d1hrv-qh1k1mg3qqokv1s on collection 0938aa40406992ee7c02666b2708fbce+73567 , some input files do not get processed and others are duplicated.

The program from the above pipeline consists of a single job, calling 'one_task_per_input_file' on the collection:

#!/usr/bin/env python
#
# Simple program to expose dropped input files and duplicated input files.
#

import arvados
import os
import sys
import subprocess as sp

arvados.job_setup.one_task_per_input_file( if_sequence=0, and_end_task=True, input_as_path=True )

this_job = arvados.current_job()
this_task = arvados.current_task()
this_task_input = this_task['parameters']['input']

work_dir  = os.environ['CRUNCH_SRC']
mount_dir = os.environ['TASK_KEEPMOUNT']

input_filename = arvados.get_task_param_mount('input')

out_dir = os.path.join( arvados.current_task().tmpdir, "output" )
os.mkdir( out_dir )

out_fn = os.path.join( out_dir, "dummyfile" )
print "INPUTFILE:", input_filename

dummyExec = os.path.join( work_dir, "crunch_scripts/multipleFileBug/writedummy.sh" )
pOut = sp.check_output( [ dummyExec, "dummytext", out_fn ] )

out = arvados.CollectionWriter()
out.write_directory_tree( out_dir, max_manifest_depth=0 )
this_task.set_output( out.finish() )
sys.exit(0)

In the output log 'd3bee5816a311e5cee846e06ef2b97ec+89' should produce 863 unique file paths on the 'INPUTFILE:' lines, instead it porduces 860. The input file 'chr8_band2_s6200000_e12700000.bedGraph' appears four times when it should only appear once. The files 'chr11_band4_s16200000_e21700000.bedGraph', 'chr13_band6_s23300000_e25500000.bedGraph' and 'chr18_band0_s0_e2900000.bedGraph' should appear but are missed.

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg almost 11 years ago

Story points set to 1.0

Actions

Copy link

Updated by Tom Clegg almost 11 years ago

Target version set to 2014-07-16 Sprint

Actions

Copy link

Updated by Tim Pierce over 10 years ago

Assigned To set to Tim Pierce

Actions

Copy link

Updated by Tim Pierce over 10 years ago

I haven't been able to reproduce this issue. I cloned git@git.qr1hi.arvadosapi.com:abram.git and copied the multipleFileBug directory into my 4xphq repository, and ran pipeline https://workbench.4xphq.arvadosapi.com/pipeline_instances/4xphq-d1hrv-022dunoh2pt5sdd. The pipeline completed successfully, but generated the empty collection and did not produce a diagnostic log. According to the job API response, the expected 864 tasks were run successfully.

Actions

Copy link

Updated by Tim Pierce over 10 years ago

Category set to Crunch
Status changed from New to Rejected
Target version changed from 2014-07-16 Sprint to Deferred

Rejecting this bug -- let's revisit it if we can reproduce the problem reliably.

Actions

Copy link

Updated by Tim Pierce over 10 years ago

Status changed from Rejected to New
Target version changed from Deferred to 2014-08-06 Sprint

Returning this to "new" -- there are other bugs (#3212 at least) preventing me from adequately determining whether this can be reproduced. Will attempt to reproduce again after that has been resolved.

Actions

Copy link

Updated by Brett Smith over 10 years ago

It looks like the SDK set up the tasks correctly:

>>> tasks = api.job_tasks().list(filters=[['job_uuid', '=', 'qr1hi-8i9sb-j0gyutpct11dcnb'], ['sequence', '=', '1']], limit=1000).execute()
>>> len(tasks['items'])
863
>>> inputs = set(t['parameters']['input'] for t in tasks['items'])
>>> len(inputs)
863
>>> any(inp for inp in inputs if inp.endswith('chr13_band6_s23300000_e25500000.bedGraph'))
True
>>> any(inp for inp in inputs if inp.endswith('chr11_band4_s16200000_e21700000.bedGraph'))
True

Actions

Copy link

Updated by Brett Smith over 10 years ago

Here's the problem:

>>> set(t['qsequence'] for t in tasks['items'])
set([None])

crunch-job fetches tasks ordered by qsequence. If that's not set, then it could happen that when crunch-job fetches tasks from the API server, it will get some twice, and others not at all.

JobTask on the API server has a callback after_update :assign_created_job_tasks_qsequence_if_succeeded. It seems likely this is buggy and needs fixing.

Actions

Copy link

Updated by Brett Smith over 10 years ago

Assigned To changed from Tim Pierce to Brett Smith

Actions

Copy link

#10

Updated by Tom Clegg over 10 years ago

Status changed from New to In Progress

Actions

Copy link

#11

Updated by Tom Clegg over 10 years ago

Target version changed from 2014-08-06 Sprint to 2014-07-16 Sprint

Actions

Copy link

#12

Updated by Tom Clegg over 10 years ago

In services/api/test/unit/job_task_test.rb I think it would be slightly safer to use create! instead of create here. Otherwise (I think) it's possible for object creation to be abandoned due to some filter/hook that runs earlier than set_default_qsequence, which would let the test pass silently without really checking anything:

task = JobTask.create(qsequence: 99)

That's the only complaint I can come up with.

(As I got to the end of the diff I was thinking, "all I can think to add is a test to establish that a client-supplied qsequence doesn't get overwritten with the default" ...and there it was right at the end of the diff. Thanks!)

Actions

Copy link

#13

Updated by Tom Clegg over 10 years ago

Status changed from In Progress to Resolved

Actions

Copy link

#14

Updated by Brett Smith over 10 years ago

Now that my bugfix has deployed to staging, I tested it out. I used the following crunch script, which just spawns a bunch of tasks (how many is specified as a job parameter) and has each one write out a line. With enough tasks spawned, duplicate/missing tasks should be noticeable.

#!/usr/bin/env python

import arvados

output = arvados.CollectionWriter()
output.set_current_file_name("log.txt")

task_count = int(arvados.getjobparam('task_count'))
task_num_fmt = '{{:0{}d}}'.format(len(str(task_count)))

def seq0(cur_task):
    task_body = {'job_uuid': arvados.current_job()['uuid'],
                 'created_by_job_task_uuid': cur_task['uuid'],
                 'sequence': 1,
                 'parameters': {'task_num': None}}
    make_task = arvados.api('v1').job_tasks().create
    for n in xrange(1, task_count + 1):
        task_body['parameters']['task_num'] = n
        make_task(body=task_body).execute()
    output.write("Origin task created {} tasks\n".format(task_count))

def seq1(cur_task):
    task_num_s = task_num_fmt.format(cur_task['parameters']['task_num'])
    output.write("Task {} has qsequence {:09d}\n".format(
            task_num_s, int(cur_task['qsequence'])))

_current_task = arvados.current_task()
_seq_func = locals()['seq{:d}'.format(int(_current_task['sequence']))]
_seq_func(_current_task)
_current_task.set_output(output.finish())

I used the following script to check the output and make sure that every task had one output line (and that their qsequence numbers increased monotonically):

#!/usr/bin/env python3

import fileinput
import sys

def numbers():
    global expected
    lines = fileinput.input()
    line1 = next(lines)
    expected = int(line1.split()[3])
    for line in lines:
        words = line.split()
        yield int(words[1]), int(words[4])

task_nums = set()
last_qseq = -1

for tnum, qseq in numbers():
    if tnum in task_nums:
        print("WARNING:", tnum, "duplicated")
    else:
        task_nums.add(tnum)
    if qseq <= last_qseq:
        print("WARNING: bad qseq step", last_qseq, qseq, "from", tnum)
    else:
        last_qseq = qseq

diff = set(range(1, expected + 1)).difference(task_nums)
if diff:
    print(diff)
    sys.exit(1)

Invoked to run ~1000 tasks, this job tickled the bug immediately on qr1hi (one task was run five times!), but three runs on staging passed with flying colors. With these results, I believe that the previous bugfix actually fixed the bug, so I'm closing this ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Watchers (1)

Bug #3113

Some crunch tasks miss some input files and have others duplicated

Updated by Tom Clegg almost 11 years ago

Updated by Tom Clegg almost 11 years ago

Updated by Tim Pierce over 10 years ago

Updated by Tim Pierce over 10 years ago

Updated by Tim Pierce over 10 years ago

Updated by Tim Pierce over 10 years ago

Updated by Brett Smith over 10 years ago

Updated by Brett Smith over 10 years ago

Updated by Brett Smith over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Tom Clegg over 10 years ago

Updated by Brett Smith over 10 years ago