Actions
Feature #22770
openImprove logging and error reporting when crunch-run fails to load a Docker image/start container
Story points:
-
Description
Because of an ops bug, compute nodes did not have enough space for larger Docker images. See #22617. This led to the following series of events on a compute node:
Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.089168237Z" level=info msg="Layer sha256:5bcad8f8b858991e09c1ea59d2b1d8a1fcb424445493825972918b48c5170289 cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.089960686Z" level=info msg="Layer sha256:9adaba72eca85f66961ee4fb5c0b646b078377439690fa94827bf35a5f0cfd5e cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.090707511Z" level=info msg="Layer sha256:ccab9ec9909359830f26cd442a80b08ab27a8fd34881495da7a17ac224ceebf1 cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.191874552Z" level=info msg="Layer sha256:739482a9723dbee6dbdad6b669090e96d5f57e2aa27c131015cce8969d0d4efa cleaned up" Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.166087313Z loaded image: response {"errorDetail":{"message":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"},"error":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"} Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.166160514Z Creating Docker container Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.169728752Z error in Run: While creating container: Error response from daemon: No such image: sha256:e7b5bddabcf740c7a382991098225f067134790b480049b0b8dd53bc38d8814b Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.206013218Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.tordo-dz642-oljtukc4zlwwwpl.1732696392/keep2513045995] Apr 10 08:49:48 crunch-run[1136]: 2025-04-10T12:49:48.475241603Z fusermount: failed to unmount /tmp/crunch-run.tordo-dz642-oljtukc4zlwwwpl.1732696392/keep2513045995: Invalid argument Apr 10 08:49:48 systemd[1]: tmp-crunch\x2drun.tordo\x2ddz642\x2doljtukc4zlwwwpl.1732696392-keep2513045995.mount: Succeeded. Apr 10 08:49:48 systemd[837]: tmp-crunch\x2drun.tordo\x2ddz642\x2doljtukc4zlwwwpl.1732696392-keep2513045995.mount: Succeeded. Apr 10 08:49:50 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:50.758562807Z crunch-run finished Apr 10 08:49:50 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl: While creating container: Error response from daemon: No such image: sha256:e7b5bddabcf740c7a382991098225f067134790b480049b0b8dd53bc38d8814b
As far as anybody has been able to tell, none of this information got propagated up to where anyone else can see it. Even in the a-d-c logs, there's just:
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","InstanceTypes":"m52xlarge","PID":934892,"Priority":562948209132255239,"State":"Queued","level":"info","msg":"adding container to queue","time":"2025-04-10T12:46:29.029767789Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"creating new instance","time":"2025-04-10T12:46:29.111442317Z"}
{"Address":"10.253.254.169","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-0f982e9dcd477e85b","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:47:58.362721191Z"}
{"Address":"10.253.254.169","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-0f982e9dcd477e85b","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:48:16.412237895Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:48:20.385099242Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:48:56.489456546Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:49:56.612067002Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:50:00.944068316Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:50:46.132529204Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:51:36.312149612Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:51:41.705465690Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:51:51.849035610Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:52:46.211803691Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:52:52.187972146Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:53:02.335367913Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:53:56.611799367Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:54:02.652553134Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:54:06.162243174Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:54:56.312156013Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:55:03.155131892Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:55:13.292038431Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:56:05.988126775Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:56:13.596571135Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:56:13.783122797Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:57:06.312407334Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:57:13.949101991Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:57:24.090142141Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:58:16.229224379Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"Priority":562948209132255239,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-04-10T12:58:24.340884782Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process started","time":"2025-04-10T12:58:24.702358239Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"Reason":"state=Cancelled","level":"info","msg":"killing crunch-run process","time":"2025-04-10T12:59:14.660141765Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"Signal":15,"level":"info","msg":"sending signal","time":"2025-04-10T12:59:14.660184635Z"}
{"Address":"10.253.254.202","ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","Instance":"i-01f7ed400b848c7d1","InstanceType":"m52xlarge","PID":934892,"level":"info","msg":"crunch-run process ended","time":"2025-04-10T12:59:14.678604095Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"State":"Cancelled","level":"info","msg":"container finished -- dropping from queue","time":"2025-04-10T12:59:14.715785560Z"}
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-oljtukc4zlwwwpl","PID":934892,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2025-04-10T12:59:14.715822850Z"}
And since this is happening at the "container retry" level of Arvados, none of this appears in user-visible logs either.
- The way it's logged, crunch-run seems to be OK with the "failed to load image" error response from Docker. That should be detected and reported immediately.
- The "can't start container: no such image" error should've also been reported up where it's more visible.
Updated by Tom Clegg 7 months ago
- Related to Bug #22771: crunch-run handle failure to load image & mark node as broken when out of disk space added
Updated by Lucas Di Pentima 7 months ago
Had 2 instances this week where this could have been super helpful to me.
Actions