Project

General

Profile

Actions

Bug #18262

open

[crunch-run] handle out-of-diskspace on the compute node better

Added by Ward Vandewege about 3 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

When a job consumes all available disk space on a compute node, and the node was not started with a particular scratch space requirement (i.e. no extra partition was added), bad things happen because the job fills up the root partition of the node.

In one example today, a workflow filled up the root partition (which was tiny) which caused /etc/resolv.conf to be emptied on the next dhcp renew (sigh), which caused crunch-run to be unable to find the api server and keepstores and had the effect that the container failed with truncated logs, and without explicitly being marked as such. It looked as if crunch-run was crashing until we caught the compute node in the act, which was a bit of a debugging adventure.

Can we somehow restrict the amount of disk space the container is allowed to use?

Actions #1

Updated by Ward Vandewege about 3 years ago

  • Description updated (diff)
Actions #2

Updated by Lucas Di Pentima almost 2 years ago

  • Release set to 60
Actions

Also available in: Atom PDF