Bug #7444: [Crunch] Docker container not removed when job canceled, filling disk - Arvados

Actions

Copy link

Bug #7444

closed

[Crunch] Docker container not removed when job canceled, filling disk

Added by Brett Smith over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

2015-11-11 sprint

Start date:

10/02/2015

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

2.0

Description

We use docker run --rm to ensure that Docker containers are removed after tasks are finished, to prevent compute nodes from filling up with unused volumes. However, docker run --rm is handled by the Docker client. It simply makes the necessary API calls to remove the container after it exits.

Crunch's cancel code kills the Docker client. If a user cancels a job, the container will hang around, along with its volumes. We just had a situation where compute nodes on a cluster filled their /tmp partitions, because a user was canceling many jobs, leaving it full of finished Docker containers and their large tmp volumes.

Make sure that when Crunch cancels a job, the corresponding Docker container is removed.

Implementation¶

Extend crunch-job to stop using --rm, and name containers after the task. Append ".$try_number" to the name to avoid name collisions when tasks are retried.
Extend the Docker cleaner service to listen for container stop events, and immediately destroy those containers. Sysadmins who want to debug Docker on compute nodes are expected to stop the Docker cleaner service to do that.

Subtasks 5 (0 open — 5 closed)

Actions

Copy link

Updated by Brett Smith over 9 years ago

Subject changed from [Crunch] Job containers not removed consistently, filling disk to [Crunch] Docker container not removed when job canceled, filling disk
Description updated (diff)

Actions

Copy link

Updated by Brett Smith over 9 years ago

Description updated (diff)
Story points set to 2.0

Actions

Copy link

Updated by Brett Smith over 9 years ago

Target version changed from Arvados Future Sprints to 2015-10-28 sprint

Actions

Copy link

Updated by Peter Amstutz over 9 years ago

Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Brett Smith over 9 years ago

Target version changed from 2015-10-28 sprint to Arvados Future Sprints

Actions

Copy link

Updated by Tom Clegg over 9 years ago

Assigned To changed from Peter Amstutz to Tom Clegg
Target version changed from Arvados Future Sprints to 2015-11-11 sprint

Actions

Copy link

Updated by Tom Clegg over 9 years ago

Naming containers sounds like a good idea anyway, but seems tangential. Unless dockercleaner is supposed to pay attention to the names, perhaps in order to exempt non-Crunch containers from automatic removal...?

Actions

Copy link

Updated by Tom Clegg over 9 years ago

Should dockercleaner also delete all stopped containers that are already present when it starts up? This would help keep a long-running (e.g., bare metal) worker node clean.

If/when we do add this, I think it should have a separate command line flag, to support a workflow like

Turn off dockercleaner
Run a job
Turn on dockercleaner --leave-existing-containers
Inspect the container left behind by the above job, but let subsequent jobs get cleaned up

Until then, there's "docker ps --filter status=exited --format {{.ID}} | xargs docker rm".

Actions

Copy link

Updated by Tom Clegg over 9 years ago

7444-dockercleaner-containers @ e10ccab

7444-no-docker-rm @ 07beca7

Actions

Copy link

#10

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

Naming containers sounds like a good idea anyway, but seems tangential.

You are right it is not necessary for the dockercleaner changes. I previously had an implementation idea based on naming containers predictably and having crunch-job remove them. This is basically a remnant of that—there was still interest in naming as a debugging aid.