Project

General

Profile

Actions

Bug #22971

closed

a-d-c apparently lost track of a node with an unused instance type

Added by Brett Smith 10 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Steps to reproduce:

  1. Have a-d-c start a node and have it running.
  2. Remove that node's instance type from /etc/arvados/config.yml.
  3. Restart a-d-c with the node still running.

A user did this and it seems like a-d-c stopped supervising the node started in step 1. It was left running for months.


Subtasks 1 (0 open1 closed)

Task #23393: Review 22971-drain-unknown-instance-typeResolvedTom Clegg01/21/2026Actions
Actions #1

Updated by Brett Smith 10 months ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg 3 months ago

source:lib/dispatchcloud/worker/pool.go

func (wp *Pool) sync(threshold time.Time, instances []cloud.Instance) {
        //...
                itTag := inst.Tags()[wp.tagKeyPrefix+tagKeyInstanceType]
                it, ok := wp.instanceTypes[itTag]
                if !ok {
                        wp.logger.WithField("Instance", inst.ID()).Errorf("unknown InstanceType tag %q --- ignoring", itTag)
                        continue
                }
Actions #3

Updated by Tom Clegg 3 months ago

  • Target version set to Development 2026-01-21
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #4

Updated by Tom Clegg 3 months ago

Actions #5

Updated by Tom Clegg 2 months ago

  • Subtask #23393 added
Actions #6

Updated by Brett Smith 2 months ago ยท Edited

Tom Clegg wrote in #note-4:

22971-drain-unknown-instance-type @ 0a53ff54e368c4fddcb00b139fdfb6891d099982 -- developer-run-tests: #5003

I would just like to wordsmith the logs a little. I have two concerns:

  1. The next time we see this issue months from now, and we're just reading the logs, I don't think it will be clear that "invalid" means "not in cluster configuration."
  2. I feel less strongly about this but I think it would be nice to write the action first then explain why it was taken.

Taken together, something like: "< draining | holding > instance with unconfigured type and IdleBehavior = %s" Happy to workshop it in chat if you want. This can be merged once we're on the same page.

Actions #7

Updated by Tom Clegg 2 months ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Brett Smith about 2 months ago

  • Release set to 84
Actions

Also available in: Atom PDF