Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended. - Arvados

Actions

Copy link

Bug #4044

closed

[Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.

Added by Ward Vandewege over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

2014-10-29 sprint

Start date:

10/07/2014

Due date:

% Done:

100%

Estimated time:

(Total: 1.00 h)

Story points:

1.0

Description

The first task in job 9tee4-8i9sb-jykwrdj3quy00t5 create a whole lot of new tasks. Then it exits (succesfully, no docker container running, and gets marked as such in the api server) but somehow the job does not realize that, and does not get updated. The job then sits around waiting for the first task to finish.

The job record is stuck with 1 running, 0 todo.

All the job tasks are created properly though. The first one, as an example, is 9tee4-ot0gb-vmlrb41ai6iswlr.

I verified that while the job is stuck, it is not trying to read data from Keep.

Note: this particular job has a malformed input collection (the files are in a flat hierarchy, rather than split up per directory). It will never complete successfully. So the bug is that Crunch gets into this weird state with a stuck job; it should fail explicitly if it can't run the tasks.

This happened again on job 9tee4-8i9sb-50r7da1j0ks4db1

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ward Vandewege over 10 years ago

Subject changed from [Crunch] job spawns tasks but never runs them to [Crunch] job spawns tasks but never runs them, gets stuck waiting for first

Actions

Copy link

Updated by Ward Vandewege over 10 years ago

Subject changed from [Crunch] job spawns tasks but never runs them, gets stuck waiting for first to [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.

Actions

Copy link

Updated by Ward Vandewege over 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Abram Connelly over 10 years ago

Though it may be unrelated, since the files got dumped into the base directory instead of being split up into their own sub-directories, the blocks associated to individual filenames got interleaved from the collation at the end.

Actions

Copy link

Updated by Ward Vandewege over 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Abram Connelly over 10 years ago

Another clue: I gave 'run-command' the 'PersonToTileLibraryShim' script to run but it didn't exist. The actual script is 'PersonToTileLibraryShim.sh' (in job 9tee4-8i9sb-50r7da1j0ks4db1)

Actions

Copy link

Updated by Abram Connelly over 10 years ago

On investigation, it looks like line 127 in 'run-command' is causing the issue:

    122 try:
    123     if "task.foreach" in jobp:
    124         if arvados.current_task()['sequence'] == 0:
    125             var = jobp["task.foreach"]
    126             items = get_items(jobp, jobp[var])
    127             logging.info("parallelizing on %s with items %s" % (var, items))
    128             if items is not None:
    129                 for i in items:
    130                     params = copy.copy(jobp)
    131                     params[var] = i
    132                     arvados.api().job_tasks().create(body={
    133                         'job_uuid': arvados.current_job()['uuid'],

That is, the 'logging.info("parallelizing on %s with items %s" % (var, items))' line.

Where the 'items' variable is a list of 863 length with each entry of about 60-80 characters. Taking it out causes the job to fail (as expected). Putting it back in and the job hangs. When that 'logging.info(...)' call is in place, execution in run-command does not proceed past that point.

Actions

Copy link

Updated by Abram Connelly over 10 years ago

Updating to the latest 'run-command' and 'crunchutil' in the crunch scripts appears to sovle the issue. The pipeline/job is still failing but for 'valid' reasons. The job is not hanging anymore.

Using 'run-command' and 'crunchutil' from git hash 5d006b95fdcbb48afffef8272b4d2071acb5221b.

Here is a diff from the versio of 'run-command' and 'subst.py' I was using to the new version:

$ diff run-command $HOME/prog/arvados/crunch_scripts/run-command
4c4,9
< logging.basicConfig(level=logging.INFO, format="run-command: %(message)s")
---
> 
> logger = logging.getLogger('run-command')
> log_handler = logging.StreamHandler()
> log_handler.setFormatter(logging.Formatter("run-command: %(message)s"))
> logger.addHandler(log_handler)
> logger.setLevel(logging.INFO)
127c132
<             logging.info("parallelizing on %s with items %s" % (var, items))
---
>             logger.info("parallelizing on %s with items %s" % (var, items))
168,169c173,178
<     logging.info("{}{}{}".format(' '.join(cmd), (" < " + stdinname) if stdinname is not None else "", (" > " + stdoutname) if stdoutname is not None else ""))
< 
---
>     logger.info("{}{}{}".format(' '.join(cmd), (" < " + stdinname) if stdinname is not None else "", (" > " + stdoutname) if stdoutname is not None else ""))
> except subst.SubstitutionError as e:
>     logger.error(str(e))
>     logger.error("task parameters were:")
>     logger.error(pprint.pformat(taskp))
>     sys.exit(1)
171,173c180,182
<     logging.exception("caught exception")
<     logging.error("task parameters was:")
<     logging.error(pprint.pformat(taskp))
---
>     logger.exception("caught exception")
>     logger.error("task parameters were:")
>     logger.error(pprint.pformat(taskp))
189c198
<         logging.critical("terminating on signal %s" % sig.sig)
---
>         logger.critical("terminating on signal %s" % sig.sig)
192c201
<         logging.info("completed with exit code %i (%s)" % (rcode, "success" if rcode == 0 else "failed"))
---
>         logger.info("completed with exit code %i (%s)" % (rcode, "success" if rcode == 0 else "failed"))
195c204
<     logging.exception("caught exception")
---
>     logger.exception("caught exception")
205c214
< logging.info("the following output files will be saved to keep:")
---
> logger.info("the following output files will be saved to keep:")
209c218
< logging.info("start writing output to keep")
---
> logger.info("start writing output to keep")
219c228
<     outcollection = robust_put.upload(outdir)
---
>     outcollection = robust_put.upload(outdir, logger)
$
$
$ diff crunchutil/subst.py $HOME/prog/arvados/crunch_scripts/crunchutil/subst.py
3a4,6
> class SubstitutionError(Exception):
>     pass
> 
31c34
<         raise Exception("Substitution error, mismatched parentheses {}".format(c))
---
>         raise SubstitutionError("Substitution error, mismatched parentheses {}".format(c))
49c52
<         raise Exception("$(glob): No match on '%s'" % v)
---
>         raise SubstitutionError("$(glob): No match on '%s'" % v)
60d62
<         #print("c is", c)
62,70c64,75
<         if m is not None:
<             v = do_substitution(p, c[m[0]+2 : m[1]])
<             var = True
<             for sub in subs:
<                 if v.startswith(sub):
<                     r = subs[sub](v[len(sub):])
<                     var = False
<                     break
<             if var:
---
>         if m is None:
>             return c
> 
>         v = do_substitution(p, c[m[0]+2 : m[1]])
>         var = True
>         for sub in subs:
>             if v.startswith(sub):
>                 r = subs[sub](v[len(sub):])
>                 var = False
>                 break
>         if var:
>             if v in p:
71a77,82
>             else:
>                 raise SubstitutionError("Unknown variable or function '%s' while performing substitution on '%s'" % (v, c))
>             if r is None:
>                 raise SubstitutionError("Substitution for '%s' is null while performing substitution on '%s'" % (v, c))
>             if not isinstance(r, basestring):
>                 raise SubstitutionError("Substitution for '%s' must be a string while performing substitution on '%s'" % (v, c))
73,75c84
<             c = c[:m[0]] + r + c[m[1]+1:]
<         else:
<             return c
---
>         c = c[:m[0]] + r + c[m[1]+1:]

Actions

Copy link

Updated by Tom Clegg over 10 years ago

Status changed from New to In Progress
Story points set to 1.0

Actions

Copy link

#10

Updated by Ward Vandewege over 10 years ago

Target version changed from Bug Triage to 2014-10-08 sprint

Actions

Copy link

#11

Updated by Tom Clegg over 10 years ago

There's still an unresolved mystery in 9tee4-8i9sb-50r7da1j0ks4db1: after doing approximately nothing for ~4 minutes, crunchstat says "user -1" and then goes silent for 10 minutes until the job is cancelled.

2014-10-01_19:15:46 9tee4-8i9sb-50r7da1j0ks4db1 11998 0 stderr crunchstat: cpuacct.stat user 0
2014-10-01_19:15:46 9tee4-8i9sb-50r7da1j0ks4db1 11998 0 stderr crunchstat: cpuacct.stat sys 0
2014-10-01_19:15:56 9tee4-8i9sb-50r7da1j0ks4db1 11998 0 stderr crunchstat: cpuacct.stat user -1
2014-10-01_19:15:56 9tee4-8i9sb-50r7da1j0ks4db1 11998 0 stderr crunchstat: cpuacct.stat sys 0
2014-10-01_21:25:10 9tee4-8i9sb-50r7da1j0ks4db1 11998  Job cancelled at 2014-10-01T21:25:10Z by user 9tee4-tpzed-2l9nsfm4mgltv4c
2014-10-01_21:25:10 9tee4-8i9sb-50r7da1j0ks4db1 11998  wait for last 1 children to finish
2014-10-01_21:25:10 9tee4-8i9sb-50r7da1j0ks4db1 11998 0 sending 2x signal 2 to pid 12429

Reading crunchstat.go it seems that if cpuacct/stat starts saying "user 0" -- or, more likely, an error occurs during os.Open, iouitl.ReadAll, or fmt.Sscanf -- crunch-stat will print a negative CPU usage and then recede into "don't output anything" mode until cpuacct/stat starts reporting non-zero numbers in the expected format.

IIRC it was reported that the docker container was not running during this 10-minute period. This would be consistent with the "cpuacct/stat cannot be opened+read+parsed" behavior. This suggests crunchstat failed to notice that its child had exited.

Should we put the cmd.Wait() block before the 2× <-finish_chan?

Actions

Copy link

#12

Updated by Tom Clegg over 10 years ago

4044-crunchstat-wait @ 292856a

This branch fixes/simplifies a bunch of pipe/channel stuff in crunchstat. I haven't found a way to reproduce the hang, but I did fix a few bugs that will at least prevent crunchstat from going silent (and reporting negative CPU usage) due to unreadable cgroup files. I also removed some unnecessary attempts to synchronize stdout/stderr channels that could conceivably contribute to a deadlock condition.

Here's a quick example of testing crunchstat using the "ubuntu" docker image, in the absence of a proper test suite:

GOPATH="/tmp/tmp.O0vXh97Iqx" go get git.curoverse.com/arvados.git/services/crunchstat
rm /tmp/cid
echo "I am stdin" | \
  /tmp/tmp.O0vXh97Iqx/bin/crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/cid -poll=1000 \
  docker run --rm=true --cidfile=/tmp/cid --attach=stdin --attach=stdout --attach=stderr -i ubuntu \
  stdbuf --output=0 --error=0 --input=0 \
  sh -c 'wc >&2; sleep 3; echo ok'; echo DONE

Actions

Copy link

#13

Updated by Tom Clegg over 10 years ago

Category set to Crunch
Assigned To set to Tom Clegg

Actions

Copy link

#14

Updated by Tom Clegg over 10 years ago

Still looking for consistent results sending data into docker run on stdin:

tom.shell.9tee4:~# docker -v
Docker version 1.1.2, build d84a070

tom.shell.9tee4:~# echo foo | docker run -a stdin -a stdout ubuntu wc
      0       0       0

tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]

tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]

tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]

tom.shell.9tee4:~# echo foo | docker run -a stdin -a stdout ubuntu wc
      0       0       0

Aha(?):

tom.shell.9tee4:~# echo foo | docker run -i -a stdin -a stdout ubuntu wc
      1       1       4

It seems that:

`docker run` flags	result
`-i`	hang
`-a stdin`	stdin is empty in container
`-i -a stdin`	stdin works

Actions

Copy link

#15

Updated by Peter Amstutz over 10 years ago

Looking at 292856a

I suspect that we're going to get stderr and stdout stepping on each other in the logs, at least until we do #4028 so that the "synchronize lines" logic can happen downstream. I guess we will find out if this is actually a problem or not.
If there's an Open() error reading any of the cgroup stat files, it uses "continue" to go back to the top of the loop and won't do any of the subsequent stats.

Actions

Copy link

#16

Updated by Tom Clegg over 10 years ago

Peter Amstutz wrote:

Looking at 292856a

I suspect that we're going to get stderr and stdout stepping on each other in the logs, at least until we do #4028 so that the "synchronize lines" logic can happen downstream. I guess we will find out if this is actually a problem or not.

Indeed, there's a risk that log lines will get mangled by crunch-job when tasks write to stdout. But crunchstat is the wrong place to tackle that issue. (For one thing, even if crunchstat is careful to write whole lines to stderr and stdout, there's still no guarantee that crunch-job won't mangle them.)

If there's an Open() error reading any of the cgroup stat files, it uses "continue" to go back to the top of the loop and won't do any of the subsequent stats.

Yes. This probably means the cgroup no longer exists (or something else is terribly wrong) and there's no point writing N error messages per time interval. Added comments @ 5ed4ae4

Actions

Copy link

#17