Bug #4368: [Crunch] Improve node failure detection and job retry logic - Arvados

Actions

Copy link

Bug #4368

closed

[Crunch] Improve node failure detection and job retry logic

Added by Brett Smith over 11 years ago. Updated about 11 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Crunch

Target version:

Story points:

Description

One of Crunch's ideal responsibilities is that it's supposed to detect when a job fails because a node failed, and retry the job on a different node. Unfortunately, it's not coping so well when Node Manager shuts down nodes. There are a couple of ways this can go:

In between the time Node Manager issues the shutdown command, and the time that SLURM notices that the node is down, Crunch may decide to try to dispatch work to the node. #4334 will shrink this window of time, but there's no way to close it completely. In this case, the initial node allocation will usually fail, and this is the most common failure mode we're seeing for jobs right now.
There may be a rapid succession of events where Crunch assigns work to a node, the node enters a shutdown window, and Node Manager decides to shut it down before it sees the new assignment. In this case, the initial allocation succeeds, and the job may even officially start, but it will die unceremoniously before long.

Crunch should be able to detect both these cases. It should not mark the job failed, but instead retry it on a functional node.

Peter brought up this issue of Node Manager causing job failures during the code review, and the result was #4127, a proposed addition to the Node API to safely declare shutdowns. Reflecting on it some more, better detection and handling of node failures is something we've always wanted Crunch to do anyway; useful in contexts beyond Node Manager intentionally shutting down nodes; and would probably take similar or even less development time—there's no new API to document and test, no clients to update, just internal Crunch logic.

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Brett Smith over 11 years ago

Target version changed from Bug Triage to Arvados Future Sprints

Actions

Copy link

Updated by Brett Smith over 11 years ago

Made lower priority in favor of #4380.

Actions

Copy link

Updated by Brett Smith about 11 years ago

Status changed from New to Closed
Target version deleted (~~Arvados Future Sprints~~)

Now that #4380 is done and we've seen great results, I'm closing this issue. This issue's description is very focused on the interaction between Node Manager and Crunch, and the work done on #4380 has addressed both potential problems outlined here.

The subject line of this issue remains to be done, but with the issue description and background, I think it'd be better to let that work be taken over by stories that started out more generally, like #5064.

Actions

Copy link

Also available in: Atom PDF

Related to Arvados - Bug #4334: [Crunch] crunch-dispatch should not allocate Jobs to nodes in the idle* SLURM state	Resolved	Peter Amstutz	10/28/2014		Actions
Related to Arvados - Idea #4127: [API] Nodes have a method to request and record shutdowns	Closed		07/17/2014	07/17/2014	Actions
Related to Arvados - Feature #2881: [OPS] Basic node manager that can start/stop compute nodes based on demand	Resolved	Brett Smith	07/16/2014		Actions
Related to Arvados - Bug #4380: [Node Manager] Should drain nodes via SLURM before terminating them	Resolved	Brett Smith	11/11/2014		Actions

Project

General

Profile

Arvados

Custom queries

Bug #4368

[Crunch] Improve node failure detection and job retry logic

Updated by Brett Smith over 11 years ago

Updated by Brett Smith over 11 years ago

Updated by Brett Smith about 11 years ago