Bug #4751
closed
[Node Manager] Can erroneously pair cloud nodes with stale Arvados node records
Added by Brett Smith about 10 years ago.
Updated almost 10 years ago.
Estimated time:
(Total: 1.00 h)
Description
Node Manager pairs cloud nodes with Arvados node records based solely on an IP address match. See arvnodeman.computenode.dispatch.ComputeNodeMonitorActor.offer_arvados_pair.
It can happen that a cloud node comes up with an IP address that happens to match a stale Arvados node record. Make the testing stricter so there's no pairing in this case.
- Description updated (diff)
- Target version changed from Bug Triage to Arvados Future Sprints
I think there are basically two possible approaches:
- EC2 compute nodes, at least, put their EC2 id in the Arvados node record's info. If we check against that, we can't go wrong—but it has the downside of meaning we have to reimplement this check for every cloud driver.
- Check that the Arvados node's first_ping_at is greater than the cloud node's boot time before accepting a pairing. This is completely generic, and very safe, although it could still go wrong if the total garbage data is getting into the node records.
I think I prefer #2, but I wanted to note the alternatives at least.
- Target version changed from Arvados Future Sprints to 2015-03-11 sprint
Moving to current sprint because it came up again during science support, and it's likely to become more pressing now that we've increased our max_nodes setting.
I feel like this came up in an earlier code review (discussing the pitfalls of reusing computed node records generally) so it's good to tighten up the check.
4751-node-manager-stricter-node-pairing-wip LGTM
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:6be95f5c3a2fcbe6321bba52c20393060e33e637.
- Assigned To set to Brett Smith
Also available in: Atom
PDF