Bug #4751
closed[Node Manager] Can erroneously pair cloud nodes with stale Arvados node records
100%
Description
Node Manager pairs cloud nodes with Arvados node records based solely on an IP address match. See arvnodeman.computenode.dispatch.ComputeNodeMonitorActor.offer_arvados_pair.
It can happen that a cloud node comes up with an IP address that happens to match a stale Arvados node record. Make the testing stricter so there's no pairing in this case.
Updated by Tom Clegg about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Brett Smith almost 10 years ago
I think there are basically two possible approaches:
- EC2 compute nodes, at least, put their EC2 id in the Arvados node record's info. If we check against that, we can't go wrong—but it has the downside of meaning we have to reimplement this check for every cloud driver.
- Check that the Arvados node's first_ping_at is greater than the cloud node's boot time before accepting a pairing. This is completely generic, and very safe, although it could still go wrong if the total garbage data is getting into the node records.
I think I prefer #2, but I wanted to note the alternatives at least.
Updated by Brett Smith almost 10 years ago
- Target version changed from Arvados Future Sprints to 2015-03-11 sprint
Moving to current sprint because it came up again during science support, and it's likely to become more pressing now that we've increased our max_nodes setting.
Updated by Peter Amstutz almost 10 years ago
I feel like this came up in an earlier code review (discussing the pitfalls of reusing computed node records generally) so it's good to tighten up the check.
4751-node-manager-stricter-node-pairing-wip LGTM
Updated by Brett Smith almost 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:6be95f5c3a2fcbe6321bba52c20393060e33e637.