Project

General

Profile

Actions

Bug #14977

closed

[arvados-dispatch-cloud] kill crunch-run procs for containers that are deleted or have state=Cancelled when dispatcher starts up

Added by Tom Clegg almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/18/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently, a container that has state==Cancelled when arvados-dispatch-cloud starts up will never be added to the container queue, even if its UUID appears on an instance's probe result. Also, a container that has been deleted from the database will never have an entry added/updated in the dispatcher's container queue.

The scheduler's sync() func is responsible for killing unneeded crunch-run processes, but it only looks at the container queue, so these crunch-run processes are allowed to run forever.

Proposed solution:

In (*scheduler.Scheduler)sync(), kill anything returned by sch.pool.Running() that isn't returned by sch.queue.Entries(). This should be safe from "kill crunch-run before seeing its UUID in the queue" races:
  • at least one "get entire queue from controller/database" has succeeded before the first call to sync()
  • UUIDs are added to Running() only during (*Scheduler)runQueue(), which does not run concurrently with (*Scheduler)sync().

In (*container.Queue)poll(), if a container's UUID is in the local queue but is not returned by the API calls that request that specific UUID, delete it from the local queue. The "get missing containers" loop will need to be more careful to avoid accidentally deleting containers when the API server chooses to return less than a full page of results.


Files

14977.png (17.7 KB) 14977.png Tom Clegg, 03/15/2019 08:46 PM

Subtasks 1 (0 open1 closed)

Task #14982: Review 14977-kill-if-not-in-queueResolvedPeter Amstutz03/18/2019

Actions

Related issues 1 (0 open1 closed)

Blocks Arvados - Story #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deployResolvedTom Clegg01/29/2019

Actions
Actions

Also available in: Atom PDF