Project

General

Profile

Actions

Bug #22771

closed

crunch-run handle failure to load image & mark node as broken when out of disk space

Added by Peter Amstutz 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

From #22617

Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.166087313Z loaded image: response {"errorDetail":{"message":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"},"error":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"}
  1. We get a response with "errorDetail" but the Docker SDK doesn't set the "err" return value. We should Unmarshal the response and see if "error" is nonempty and also return that as an error.
  2. The substring "no space left on device" should be added to the broken node blacklist

Subtasks 1 (0 open1 closed)

Task #22796: Review 22617-docker-load-errorResolvedPeter Amstutz04/28/2025Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Feature #22770: Improve logging and error reporting when crunch-run fails to load a Docker image/start containerNewActions
Actions #1

Updated by Peter Amstutz 12 months ago

  • Position changed from -950389 to -950377
  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 12 months ago

  • Status changed from In Progress to New
  • Category set to Crunch
  • Subject changed from Handle failure to load image & mark node as broken when out of disk space to crunch-run handle failure to load image & mark node as broken when out of disk space
Actions #3

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2025-04-16 to Development 2025-04-30
Actions #5

Updated by Tom Clegg 11 months ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Tom Clegg 11 months ago

  • Subtask #22796 added
Actions #7

Updated by Tom Clegg 11 months ago

22617-docker-load-error @ 28e67f3dc610b8882c24a1060c8983f2fe3cd25e -- developer-run-tests: #4755

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ Propagate errors from the docker daemon response
    • ✅ Add "no space left on device" to list of phrases that indicate a broken node
    • ✨ Improve some repetitive error messages ("while loading image: while loading image: ...")
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ Add test for error propagation
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • n/a
  • Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • Follows our coding standards and GUI style guidelines.
Actions #8

Updated by Peter Amstutz 11 months ago

LGTM, although there's a separate recurring issue with the tests failing "ValueError: cannot mmap an empty file", but this isn't the only branch affected by that.

Actions #9

Updated by Peter Amstutz 11 months ago

  • Release set to 78
Actions #10

Updated by Peter Amstutz 11 months ago

  • Status changed from New to In Progress
Actions #11

Updated by Tom Clegg 11 months ago

Rebased to fix wrong issue # in commit messages, and merged as 22771-docker-load-error.

Actions #12

Updated by Tom Clegg 11 months ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Tom Clegg 7 months ago

  • Related to Feature #22770: Improve logging and error reporting when crunch-run fails to load a Docker image/start container added
Actions

Also available in: Atom PDF