Project

General

Profile

Actions

Bug #4314

closed

[Crunch] Figure out why this job was marked Failed unexpectedly

Added by Bryan Cosca about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/24/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

examples: qr1hi-8i9sb-cha7spydhjauvvq qr1hi-8i9sb-w5hjmuq7vicng11 qr1hi-8i9sb-vt7mb676a4htd6k

They seemed to have this error in common: error: Unable to allocate resources: Requested nodes are busy. Ward said that there were two crunch dispatchers running and shutting one down seemed to fix it.

When the jobs end, they usually have a 403 Permission error and cannot output to keep.


Subtasks 3 (0 open3 closed)

Task #4373: Diagnose and fixResolvedPeter Amstutz10/24/2014

Actions
Task #4457: Review 4314-crunch-token-expireResolved10/24/2014

Actions
Task #4718: Review 4314-trigger-job-updateResolvedRadhika Chippada10/24/2014

Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #4310: [Crunch] crunch-dispatch --jobs locking is brokenResolvedPeter Amstutz11/06/2014

Actions
Related to Arvados - Bug #4334: [Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM stateResolvedPeter Amstutz10/28/2014

Actions
Actions #1

Updated by Radhika Chippada about 10 years ago

  • Target version set to Bug Triage
Actions #2

Updated by Tom Clegg about 10 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #3

Updated by Tom Clegg about 10 years ago

  • Subject changed from Failed Jobs still continue running to [Crunch] Once a job has failed, crunch-dispatch should not run it
  • Story points set to 0.5
Actions #4

Updated by Tom Clegg about 10 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Actions #5

Updated by Tom Clegg about 10 years ago

Could be a duplicate of #4310

Actions #6

Updated by Ward Vandewege about 10 years ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Peter Amstutz about 10 years ago

There's two different errors here.

Actions #8

Updated by Tom Clegg about 10 years ago

The timing here makes the #4310 explanation less than 100% convincing. What would make crunch-dispatch take any interest in a job that has had state=='Running' for 37 minutes? (Process suspended???)

2014-10-24_19:59:31 qr1hi-8i9sb-vt7mb676a4htd6k 14114  start
...
2014-10-24_20:37:11 qr1hi-8i9sb-vt7mb676a4htd6k 14114  Job state unexpectedly changed to Failed
Actions #9

Updated by Tom Clegg about 10 years ago

4314-crunch-token-expire looks good to merge, but it doesn't explain the failures reported here, does it?

Actions #10

Updated by Peter Amstutz about 10 years ago

Some additional sluthing shows that qr1hi-8i9sb-vt7mb676a4htd6k changed from "Running" to "Failed" at 2014-10-24T19:59:32Z which suggests that it was a result of a race between crunch-dispatchers, but crunch-job didn't notice it until the task had completed 35 minutes later.

Actions #11

Updated by Tom Clegg about 10 years ago

  • Subject changed from [Crunch] Once a job has failed, crunch-dispatch should not run it to [Crunch] Figure out why this job was marked Failed unexpectedly
Actions #12

Updated by Tom Clegg about 10 years ago

  • Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Actions #13

Updated by Peter Amstutz about 10 years ago

Finally figured this one out. crunch-job only checks the job state if the file listed in "CRUNCH_REFRESH_TRIGGER" has been touched recently. It does this for cancellations, but not for other state changes, so even though the job was marked "failed" almost immediately due to the crunch-dispatcher race, it didn't notice until it completed on its own. 7e4a195 fixes that (unexpected state changes will be treated as cancellations).

("crunch_refresh_trigger" is an unfortunate backchannel method of communicating from API server to crunch-job, in the future when we need crunch-dispatch to run on a separate instance from crunch-dispatch this will need to use websockets.)

Actions #14

Updated by Radhika Chippada about 10 years ago

Review feedback:

Discussed the one-liner update with Peter for background info, and the update looks go to me.

And, all api server tests passed.

My only comment was, since the update was so close to the comment "# TODO: Remove the following case block when old ..." and we agreed that we will create a separate ticket to clean the old code.

LGTM.

Actions #15

Updated by Peter Amstutz about 10 years ago

  • Status changed from New to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:4211e34c99a068e8beb0baa6522c655c35b47b20.

Actions

Also available in: Atom PDF