Project

General

Profile

Actions

Bug #8810

closed

[Crunch] `docker load` fails to connect to endpoint; srun exits 0

Added by Brett Smith almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/05/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

2016-03-22_16:33:38 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr starting: ['srun','--nodelist=compute11','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then     arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi ']
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF.
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr * Are you trying to connect to a TLS-enabled daemon without TLS?
2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr * Is your docker daemon up and running?
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: error: Node failure on compute11
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074  load docker image: exit 0

From here the job continued running and generating errors until the UID 0 check failed. Instead crunch-job should detect this error and exit such that crunch-dispatch retries the job.


Subtasks 1 (0 open1 closed)

Task #8888: Review 8810-crunch-improve-docker-loading-wipResolvedBrett Smith04/05/2016

Actions

Related issues 2 (1 open1 closed)

Related to Arvados - Bug #8811: [Crunch] `srun --nodes=1` reports "Unable to create job step: Required node not available (down or drained)" and exits 1ResolvedBrett Smith03/31/2016

Actions
Related to Arvados - Bug #8869: [Crunch] Job was repeatedly retried on same bad compute node until abandonedNew03/31/2016

Actions
Actions #1

Updated by Brett Smith almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith almost 9 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Brett Smith almost 9 years ago

Note the large time gap between the Docker error and the node failure. That seems suspicious.

Actions #4

Updated by Brett Smith almost 9 years ago

See 86f7740, which might fix this.

Actions #5

Updated by Brett Smith almost 9 years ago

  • Status changed from New to Feedback
Actions #6

Updated by Brett Smith almost 9 years ago

Need to check the Docker API docs: is the failing command the one in the if condition? If so, that would explain why it exits 0. We could add else; exit ${PIPESTATUS[0]} to return the exit code of docker images.

Actions #7

Updated by Brett Smith almost 9 years ago

  • Status changed from Feedback to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Actions #8

Updated by Tom Clegg almost 9 years ago

We have this

if ! docker images | grep
then
  arv-get | docker load
fi

If docker images fails, then we end up calling arv-get. I suspect this is incorrect. We should probably abort if docker images fails: we're probably just wasting our time running docker load.

We probably want something more like this:

if docker images | grep; then
  true
elif [[ ${PIPESTATUS[0]} == 0 ]]; then
  arv-get | docker load
else
  exit ${PIPESTATUS[0]}
fi

However, I don't think this is the bug we're seeing here.

The "node failure" and "job step aborted" messages suggest slurm (not our shell script) is behaving badly by exiting 0.

But it's also possible that docker load really did succeed:
  1. docker images failed, and exited non-zero ...so docker images | grep failed ...so we tried docker load
  2. docker load succeeded, and exited zero
  3. Node failed during srun teardown
  4. Slurm printed error messages about the node failure
  5. Slurm propagated the docker load exit status
  6. crunch-job asked slurm to do other things with the node
  7. of course the other things failed because the node had failed

In this case, the bug amounts to "if a node fails between times A and B, we get "node failure" error messages before the node failure impacts our ability to run programs on that node". In that case, it seems much less worrisome than "exited 0 on failure".

Actions #9

Updated by Brett Smith almost 9 years ago

Tom Clegg wrote:

We probably want something more like this:

[snip]

The branch I have up for review implements that.

In this case, the bug amounts to "if a node fails between times A and B, we get "node failure" error messages before the node failure impacts our ability to run programs on that node". In that case, it seems much less worrisome than "exited 0 on failure".

I believe we changed our behavior here in 86f7740. Now crunch-job will "see" the node failure message, and treat the setup as failed as a result. This seems like an improvement to me. If a node failure happened during a setup task, things are unlikely to go well when we start dispatching actual tasks to it. Better to fail as soon as we see trouble, to save time and avoid making logs more confusing with additional error messages.

Actions #10

Updated by Tom Clegg almost 9 years ago

8810-crunch-improve-docker-loading-wip @ df1447d LGTM. Thanks.

86f7740

Indeed, that seems to have already fixed the "confusing sequence of error messages" aspect. If I'm following the timing properly, #8810 and #8811 are different problems arising from the same underlying bug, "failed to notice early signs that the node is broken, and failed the job as a result", and both are fixed by the 8811 branch. And 8810 happened to help us see & fix another bug.

Actions #11

Updated by Brett Smith almost 9 years ago

Tom Clegg wrote:

Indeed, that seems to have already fixed the "confusing sequence of error messages" aspect. If I'm following the timing properly, #8810 and #8811 are different problems arising from the same underlying bug, "failed to notice early signs that the node is broken, and failed the job as a result", and both are fixed by the 8811 branch. And 8810 happened to help us see & fix another bug.

Yeah, Peter made a similar comment at last week's backlog grooming, so I think we're all agreed on that now. Thanks.

Actions #12

Updated by Brett Smith almost 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:2d18e31468b269e526cb6e4918912b60a48540b2.

Actions

Also available in: Atom PDF