Bug #18026
closed[singularity] crunch-run image cache race condition
100%
Description
This is from https://dev.arvados.org/issues/17755#note-26: =====================================================================================================
The first two failures I think are due to a race condition between two crunch-run processes trying to convert and caching the singularity image.
This attempt found the collection, but apparently before the cached image had been created. (not supposed to happen.)
2021-08-06T21:26:16.608212665Z Using Docker image id "sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7"
2021-08-06T21:26:16.608240987Z Loading Docker image from keep
2021-08-06T21:26:17.188261275Z building singularity image
2021-08-06T21:26:17.223285265Z [singularity build /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/keep207354666/by_uuid/tordo-4zz18-t0wx4utpwx4ligf/image.sif docker-archive:///tmp/crunch-run-singularity-917581112/image.tar]
2021-08-06T21:26:26.466105517Z INFO: Starting build...
2021-08-06T21:26:26.466105517Z FATAL: While performing build: conveyor failed to get: Error loading tar component 337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7.json: open /tmp/crunch-run-singularity-917581112/image.tar: no such file or directory
2021-08-06T21:26:26.466234171Z error in Run: While loading container image: exit status 255
2021-08-06T21:26:26.466268708Z error in CaptureOutput: error scanning files to copy to output: lstat "/var/spool/cwl": lstat /tmp/crunch-run.tordo-dz642-pez87oegh5fgbs7.099258167/tmp701045228: no such file or directory
2021-08-06T21:26:26.605452965Z Cancelled
On the second attempt, it tried to create a collection with the same temporary name (down to the exact timestamp?) and that failed.
2021-08-06T21:26:47.149336231Z Executing container 'tordo-dz642-amjt50vnz4qyn4n'
...
2021-08-06T21:26:47.972965997Z error in Run: While loading container image: error creating 'singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z' collection: request failed: https://tordo.arvadosapi.com/arvados/v1/collections: 422 Unprocessable Entity: //railsapi.internal/arvados/v1/collections: 422 Unprocessable Entity: #<ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_collections_on_owner_uuid_and_name"
2021-08-06T21:26:47.972965997Z DETAIL: Key (owner_uuid, name)=(tordo-j7d0g-7p82g804nk5l7gx, singularity image for sha256:337550d506a3fc77e30292bba95108f1cd34a33719f0dd997d0de0540522def7 2021-08-06T21:26:47Z) already exists.
2021-08-06T21:26:47.972965997Z : INSERT INTO "collections" ("owner_uuid", "created_at", "modified_by_user_uuid", "modified_at", "portable_data_hash", "updated_at", "uuid", "manifest_text", "name", "properties", "delete_at", "file_names", "trash_at", "current_version_uuid") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) RETURNING "id"> (req-8t57dqc95orqsvelydce)
2021-08-06T21:26:48.136224600Z Cancelled
Updated by Ward Vandewege over 3 years ago
- Related to Story #17755: Test singularity support on a cloud cluster by running some real workflows added
Updated by Peter Amstutz over 3 years ago
- Subject changed from [singularity] crunch-run race condition to [singularity] crunch-run image cache race condition
- Category set to Crunch
Updated by Peter Amstutz over 3 years ago
17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969
- Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time
- Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.
On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.
Updated by Peter Amstutz over 3 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 3 years ago
Peter Amstutz wrote:
17755-singularity-race @ 22d2299b6d6280726c9692b72e2109a4fa768969
- Adds ensure_unique_name to prevent two processes from creating the same "in progress" cache collection at the same time
LGTM
- Adds a Stat() check on the docker image tarfile, which is being read from singularity. This is a harmless (but possibly also useless) nudge on arv-mount to let it know we are about to read the file.
On further research, I'm still unsure about what happened in the first case (where it gets "no such file or directory" on both reading the docker image and trying to read the output directory). The arv-mount.txt log is empty. This makes me suspicious that arv-mount is either not ready or has crashed. The Stat() check will give a slightly more direct signal about whether the kernel thinks the Docker image tarfile exists.
Perhaps "start arv-mount and use it without waiting for it to be ready" worked reliably in the past only because our Docker image checking/loading inserted a substantial delay between "start" and "use", but now that image loading uses the fuse mount, we're sometimes hitting it before it's ready.
If that's true, I suppose we need a "wait for arv-mount to be ready" polling loop or something. Meanwhile/regardless, Stat seems like it will be either harmless or helpful to reveal what's happening.
LGTM
Updated by Tom Clegg over 3 years ago
- we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
- shouldn't the new Stat() call be
- if _, err := os.Stat(dockerImageID); err != nil {
+ if _, err := os.Stat(imageTarballPath); err != nil {
?
Updated by Peter Amstutz over 3 years ago
Tom Clegg wrote:
Uh, on second thought
- we already have a "wait for arv-mount to be ready" polling loop, as Peter pointed out in chat
- shouldn't the new Stat() call be
[...]
?
Fixed.
17755-singularity-race @ adc1cd0362bb9fd201b8c470eaf816ca5c212c72
Updated by Peter Amstutz over 3 years ago
The other problem with accessing the keep mount seems to be separate, it is being investigated in #18027
Updated by Peter Amstutz over 3 years ago
- Related to Bug #18027: [crunch-run] upload failed (singularity) added
Updated by Peter Amstutz over 3 years ago
- Status changed from In Progress to Resolved