Story #8561
closed
[Node Manager] Pair nodes based on ec2_instance_id rather than IP address
Added by Brett Smith almost 9 years ago.
Updated almost 9 years ago.
Assigned To:
Radhika Chippada
Estimated time:
(Total: 0.00 h)
Description
Our ping script in cloud nodes ensure that compute nodes set their "ec2_instance_id" when they ping Arvados. This is a unique identifier for the node in the cloud, although exactly where it comes from in the API varies across this cloud. It is stored in the info field of the Arvados node record.
Node Manager currently compares the IP address of the cloud node record and the Arvados node record to pair nodes. Determining a node's IP address requires an extra API call in Azure, and is relatively expensive. Instead, Node Manager should compare the cloud node's unique identifier with the Arvados node's ec2_instance_id. This may require a new method on cloud node drivers to account for differences across clouds (e.g., on one cloud the identifier is "id" and on another it's "name").
- Add a new class method
node_id(cls, node)
to the cloud node drivers in arvnodeman/computenode/driver
.
- In the base driver in
__init__.py
, this should raise NotImplementedError.
- In the EC2 and GCE drivers, this returns
node.id
.
- In the Azure driver, this returns
node.name
.
- In
arvnodeman/computenode/dispatch/__init__.py
, in the offer_arvados_pair
method, replace the condition (arvados_node['ip_address'] in self.cloud_node.private_ips)
with (arvados_node['info'].get('ec2_instance_id') == self._cloud.node_id(self.cloud_node))
.
Set the flag for the Azure libcloud driver to tell it that it no longer needs to query node IP addresses, to improve the performance of Node Manager on Azure.
- In Node Manager's Azure compute node driver's list_nodes method, when we call
super(ComputeNodeDriver, self).list_nodes()
, add the argument ex_fetch_nic=False
to list_nodes()
.
- Target version set to Arvados Future Sprints
- Target version changed from Arvados Future Sprints to 2016-03-16 sprint
- Assigned To set to Radhika Chippada
Brett to detail the necessary changes.
- Description updated (diff)
Brett: d43df73d4429ddd4c00c7ff47e10f9f2595d30b0 implements the steps in the story description.
However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.
Radhika Chippada wrote:
However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.
You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.
You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.
i had already tried this, but the issue I am having is what do I set 'ec2_instance_id' as? If I use node id (node_num), I am getting '2' for this where as node.id is '140130973094224'
How do I determine the right value for ec2_instance_id? Thanks.
First, a bit of background: the Node Manager code is pretty consistent about using "arvados_node" and "cloud_node" names. arvados_node refers to a Node record that came from the Arvados API server, while cloud node refers to information about a node that came from the local cloud API.
In each pairing test, you'll need to ensure that ec2_instance_id matches the right attribute on the mock cloud node object that the test creates. On EC2 and GCE, the right attribute is the cloud node's id attribute. On Azure, it's the cloud node's name attribute. (Because of this split, you'll probably need to write a separate test for each cloud driver.)
As long as you make sure those fields line up in the mock objects before the tests start calling actual Node Manager code, you should be good to go.
- Status changed from New to In Progress
(After much hair pulling) I learned that the issue is with our mock test setup. We need to add "self.cloud_factory().node_id.return_value" to test_daemon.py
Making progress ...
Branch 8561-node-pairing b720184c408dee0e3e9bbafb424bd736703be172
- Made the code updates suggested in the description.
- There were 9 test failures that needed some test setup updates.
- Needed to add "self.cloud_factory().node_id.return_value" to test_daemon.py. I had hardcoded '2' for this (for now). Does it need to be dynamic using node_num or something? I could not tell how to get "node_num" in this context. Please clarify. Down to 2 test failures with this.
- Please shed light on how to get the other two failing tests to pass: test_node_pairing_after_arvados_update from test_daemon and test_arvados_node_match from test_computenode_dispatch
Thanks.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:5768e5a7559c22f86d89be65245b0d269f06cc6c.
Also available in: Atom
PDF