Story #8561
closed[Node Manager] Pair nodes based on ec2_instance_id rather than IP address
100%
Description
Our ping script in cloud nodes ensure that compute nodes set their "ec2_instance_id" when they ping Arvados. This is a unique identifier for the node in the cloud, although exactly where it comes from in the API varies across this cloud. It is stored in the info field of the Arvados node record.
Node Manager currently compares the IP address of the cloud node record and the Arvados node record to pair nodes. Determining a node's IP address requires an extra API call in Azure, and is relatively expensive. Instead, Node Manager should compare the cloud node's unique identifier with the Arvados node's ec2_instance_id. This may require a new method on cloud node drivers to account for differences across clouds (e.g., on one cloud the identifier is "id" and on another it's "name").
- Add a new class method
node_id(cls, node)
to the cloud node drivers inarvnodeman/computenode/driver
.- In the base driver in
__init__.py
, this should raise NotImplementedError. - In the EC2 and GCE drivers, this returns
node.id
. - In the Azure driver, this returns
node.name
.
- In the base driver in
- In
arvnodeman/computenode/dispatch/__init__.py
, in theoffer_arvados_pair
method, replace the condition(arvados_node['ip_address'] in self.cloud_node.private_ips)
with(arvados_node['info'].get('ec2_instance_id') == self._cloud.node_id(self.cloud_node))
.
Set the flag for the Azure libcloud driver to tell it that it no longer needs to query node IP addresses, to improve the performance of Node Manager on Azure.
- In Node Manager's Azure compute node driver's list_nodes method, when we call
super(ComputeNodeDriver, self).list_nodes()
, add the argumentex_fetch_nic=False
tolist_nodes()
.
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-03-16 sprint
Updated by Brett Smith almost 9 years ago
- Assigned To set to Radhika Chippada
Brett to detail the necessary changes.
Updated by Radhika Chippada almost 9 years ago
Brett: d43df73d4429ddd4c00c7ff47e10f9f2595d30b0 implements the steps in the story description.
However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.
Updated by Brett Smith almost 9 years ago
Radhika Chippada wrote:
However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.
You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.
Updated by Radhika Chippada almost 9 years ago
You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.
i had already tried this, but the issue I am having is what do I set 'ec2_instance_id' as? If I use node id (node_num), I am getting '2' for this where as node.id is '140130973094224'
How do I determine the right value for ec2_instance_id? Thanks.
Updated by Brett Smith almost 9 years ago
First, a bit of background: the Node Manager code is pretty consistent about using "arvados_node" and "cloud_node" names. arvados_node refers to a Node record that came from the Arvados API server, while cloud node refers to information about a node that came from the local cloud API.
In each pairing test, you'll need to ensure that ec2_instance_id matches the right attribute on the mock cloud node object that the test creates. On EC2 and GCE, the right attribute is the cloud node's id attribute. On Azure, it's the cloud node's name attribute. (Because of this split, you'll probably need to write a separate test for each cloud driver.)
As long as you make sure those fields line up in the mock objects before the tests start calling actual Node Manager code, you should be good to go.
Updated by Radhika Chippada almost 9 years ago
- Status changed from New to In Progress
(After much hair pulling) I learned that the issue is with our mock test setup. We need to add "self.cloud_factory().node_id.return_value" to test_daemon.py
Making progress ...
Updated by Radhika Chippada almost 9 years ago
Branch 8561-node-pairing b720184c408dee0e3e9bbafb424bd736703be172
- Made the code updates suggested in the description.
- There were 9 test failures that needed some test setup updates.
- Needed to add "self.cloud_factory().node_id.return_value" to test_daemon.py. I had hardcoded '2' for this (for now). Does it need to be dynamic using node_num or something? I could not tell how to get "node_num" in this context. Please clarify. Down to 2 test failures with this.
- Please shed light on how to get the other two failing tests to pass: test_node_pairing_after_arvados_update from test_daemon and test_arvados_node_match from test_computenode_dispatch
Thanks.
Updated by Radhika Chippada almost 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:5768e5a7559c22f86d89be65245b0d269f06cc6c.