Project

General

Profile

Actions

Bug #4368

closed

[Crunch] Improve node failure detection and job retry logic

Added by Brett Smith about 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
10/31/2014
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

One of Crunch's ideal responsibilities is that it's supposed to detect when a job fails because a node failed, and retry the job on a different node. Unfortunately, it's not coping so well when Node Manager shuts down nodes. There are a couple of ways this can go:

  • In between the time Node Manager issues the shutdown command, and the time that SLURM notices that the node is down, Crunch may decide to try to dispatch work to the node. #4334 will shrink this window of time, but there's no way to close it completely. In this case, the initial node allocation will usually fail, and this is the most common failure mode we're seeing for jobs right now.
  • There may be a rapid succession of events where Crunch assigns work to a node, the node enters a shutdown window, and Node Manager decides to shut it down before it sees the new assignment. In this case, the initial allocation succeeds, and the job may even officially start, but it will die unceremoniously before long.

Crunch should be able to detect both these cases. It should not mark the job failed, but instead retry it on a functional node.

Peter brought up this issue of Node Manager causing job failures during the code review, and the result was #4127, a proposed addition to the Node API to safely declare shutdowns. Reflecting on it some more, better detection and handling of node failures is something we've always wanted Crunch to do anyway; useful in contexts beyond Node Manager intentionally shutting down nodes; and would probably take similar or even less development time—there's no new API to document and test, no clients to update, just internal Crunch logic.


Related issues 4 (0 open4 closed)

Related to Arvados - Bug #4334: [Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM stateResolvedPeter Amstutz10/28/2014

Actions
Related to Arvados - Story #4127: [API] Nodes have a method to request and record shutdownsClosed07/17/201407/17/2014

Actions
Related to Arvados - Feature #2881: [OPS] Basic node manager that can start/stop compute nodes based on demandResolvedBrett Smith07/16/2014

Actions
Related to Arvados - Bug #4380: [Node Manager] Should drain nodes via SLURM before terminating themResolvedBrett Smith11/11/2014

Actions
Actions

Also available in: Atom PDF