- crunch-dispatch-slurm cancels the slurm job as soon as it notices the container is cancelled. crunch-run catches SIGTERM and tries to write the buffered output and logs, but (according to sample logs) seems to give up 30-40 seconds later without actually writing them.
(this should be "...notices the container has priority=0")
As for "seems to give up 30-40 seconds later", KillWait=30 in su92l's slurm config.
"To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm.conf file and finally if they have not terminated send a SIGKILL. This gives time for the running job/step(s) to clean up." -- slurm docs
(Note "time to clean up" is not necessarily enough time to clean up.)
- even if crunch-run gets that far, it seems apiserver would refuse to update the output or log field of a container whose state is Cancelled.
(It would be helpful to check server logs to determine whether crunch-run is in fact getting as far as calling collections#create and containers#update.)
In the reported example, crunch-run did not get this far.
There are a few other oddities here, like
- crunch-run ignores everything after the first signal, even if there was an error trying to stop the container in response to the first signal
- crunch-run uses ContainerStop with a 10 nanosecond timeout, which means "send TERM; if still alive in 10ns then send KILL". This was probably intended to be 10 seconds, which is the default value for the "docker container stop" CLI command. But we pass 10ns to the docker client library, which rounds it down to "t=0" for the docker API call. (Does docker take this to mean "no timeout" or "timeout immediately"?)
- crunch-run doesn't log anything when ContainerStop succeeds so we don't know how long that docker API call took (it might have taken 0s, 10s, 30s, ...).
- crunch-run didn't log either "Container exited with code: ..." or "container wait: ...", which seems to mean ContainerWait() didn't return within KillWait=30 seconds of the first scancel.
- https://github.com/moby/moby/issues/34879 says short timeouts (like 0) can cause "docker stop" to take ~60 seconds, even though the container stops quickly. The stop() function holds the cStateLock until ContainerStop() returns, although even that doesn't seem to explain why crunch-run didn't say "Container exited with code: ..." within 30 seconds.
...but the basic problem is that crunch-run has a slurm-config-determined amount of time to write logs and output, after which it's killed with no chance to report anything. For most programs this at least ensures jobs can be killed even if they're ignoring signals, but not in this case, because crunch-run isn't the only thing that needs to be killed. If crunch-run is still alive because it's trying to kill the docker container, killing crunch-run seems final from slurm's perspective, but it doesn't account for the docker container, which can then stay alive/running for a long time without being visible to slurm or Arvados.
According to experiments, calling scancel --signal=TERM on a queued job waits for the job to start, then sends TERM. This seems to mean the appropriate choice of scancel command depends on whether the slurm job is running:
"scancel --name=$uuid --signal=TERM --state=running"
...if the job is running. "Notify crunch-run that we want to stop/cancel the container."
"scancel --name=$uuid --signal=TERM --state=suspended"
...if the job is suspended. Unless/until we're ready to implement "suspend" by pausing the container, we should probably be ignoring SIGSTOP (slurm will send a STOP signal but currently that will just pause crunch-run, not the container).
"scancel --name=$uuid --state=pending"
...if the slurm job is not running yet (and again with --state=suspended
for good measure?). "Tell slurm not to bother starting our job".