Bug #18262: [crunch-run] handle out-of-diskspace on the compute node better - Arvados

Actions

Copy link

Bug #18262

open

[crunch-run] handle out-of-diskspace on the compute node better

Added by Ward Vandewege over 4 years ago. Updated about 2 years ago.

Status:

New

Priority:

Normal

Assigned To:

Category:

Target version:

Future

Story points:

Release:

Postponed

Release relationship:

Auto

Description

When a job consumes all available disk space on a compute node, and the node was not started with a particular scratch space requirement (i.e. no extra partition was added), bad things happen because the job fills up the root partition of the node.

In one example today, a workflow filled up the root partition (which was tiny) which caused /etc/resolv.conf to be emptied on the next dhcp renew (sigh), which caused crunch-run to be unable to find the api server and keepstores and had the effect that the container failed with truncated logs, and without explicitly being marked as such. It looked as if crunch-run was crashing until we caught the compute node in the act, which was a bit of a debugging adventure.

Can we somehow restrict the amount of disk space the container is allowed to use?

Actions

Copy link

Updated by Ward Vandewege over 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz about 3 years ago

Release set to 60

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Target version set to Future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #18262

[crunch-run] handle out-of-diskspace on the compute node better

Updated by Ward Vandewege over 4 years ago

Updated by Peter Amstutz about 3 years ago

Updated by Peter Amstutz about 2 years ago