Bug #22971
closeda-d-c apparently lost track of a node with an unused instance type
Description
Steps to reproduce:
- Have a-d-c start a node and have it running.
- Remove that node's instance type from
/etc/arvados/config.yml. - Restart a-d-c with the node still running.
A user did this and it seems like a-d-c stopped supervising the node started in step 1. It was left running for months.
Updated by Tom Clegg 3 months ago
source:lib/dispatchcloud/worker/pool.go
func (wp *Pool) sync(threshold time.Time, instances []cloud.Instance) {
//...
itTag := inst.Tags()[wp.tagKeyPrefix+tagKeyInstanceType]
it, ok := wp.instanceTypes[itTag]
if !ok {
wp.logger.WithField("Instance", inst.ID()).Errorf("unknown InstanceType tag %q --- ignoring", itTag)
continue
}
Updated by Tom Clegg 3 months ago
- Target version set to Development 2026-01-21
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
22971-drain-unknown-instance-type @ 0a53ff54e368c4fddcb00b139fdfb6891d099982 -- developer-run-tests: #5002
Updated by Tom Clegg 3 months ago
22971-drain-unknown-instance-type @ 0a53ff54e368c4fddcb00b139fdfb6891d099982 -- developer-run-tests: #5003
Updated by Brett Smith 2 months ago
ยท Edited
Tom Clegg wrote in #note-4:
22971-drain-unknown-instance-type @ 0a53ff54e368c4fddcb00b139fdfb6891d099982 -- developer-run-tests: #5003
I would just like to wordsmith the logs a little. I have two concerns:
- The next time we see this issue months from now, and we're just reading the logs, I don't think it will be clear that "invalid" means "not in cluster configuration."
- I feel less strongly about this but I think it would be nice to write the action first then explain why it was taken.
Taken together, something like: "< draining | holding > instance with unconfigured type and IdleBehavior = %s" Happy to workshop it in chat if you want. This can be merged once we're on the same page.
Updated by Tom Clegg 2 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|aecd972c2a68ae84e8b2eb75c8860709bdea092f.