Bug #13022: crunch-run broken container loop - Arvados

Actions

Copy link

Bug #13022

closed

crunch-run broken container loop

Added by Peter Amstutz about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Target version:

2018-02-14 Sprint

Start date:

02/05/2018

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

Release:

Arvados 1.1.4

Release relationship:

Auto

Description

https://workbench.9tee4.arvadosapi.com/container_requests/9tee4-xvhdp-vopb57pt6o9eij1#Log

Failed partway through initialization:

2018-02-01T20:05:03.402107528Z While attaching container stdout/stderr streams: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory
2018-02-01T20:05:03.470730548Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.743593838/keep576772597]

Then it gets stuck in a loop trying to re-run the container:

2018-02-01T20:06:03.263329220Z Creating Docker container
2018-02-01T20:06:03.267277338Z While creating container: Error response from daemon: Conflict. The name "/9tee4-dz642-gobx4a24ihi8xpj" is already in use by container d2fd14fd8d99ff51fb31b489c285eb767a0309cc64d37317250ce5c0ee7b5802. You have to remove (or rename) that container to be able to reuse that name.
2018-02-01T20:06:03.345808678Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.248318477/keep062669320]

In addition, arv-mount apparently gets terminated (maybe by slurm doing killpg?) but the run directory is left in /tmp and there is a dangling mountpoint in mtab.

Looking at compute0.9tee4, I saw evidence (garbage in /tmp) that this has happened before.

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz about 7 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz about 7 years ago

Status changed from In Progress to New

Actions

Copy link

Updated by Peter Amstutz about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg about 7 years ago

Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg about 7 years ago

End of slurm-2297.out on compute0.9tee4, whose temp dir was not removed:

9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:46.856481052Z Complete
9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:47.260490146Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-ml66hs2c3ook85z.721044901/keep566145728]
slurmstepd: error: *** JOB 2297 CANCELLED AT 2018-02-02T21:18:47 *** on compute0
slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_0/job_2297/step_batch: Device or resource busy

Calling stopSignals() before CleanupDirs() means we abandon CleanupDirs() when crunch-dispatch-slurm sends a[nother] TERM signal after the container has exited.

AFAIK we always want to do an orderly shutdown no matter when we get SIGTERM, so the solution seems to be

remove stopSignals() entirely
hold cStateLock in CommitLogs() to prevent the signal handler from using CrunchLog while CommitLogs is closing it and swapping it out for a new one

Actions

Copy link