Feature #8186
closed[Node Manager] Support (ephemeral) EBS storage for AWS node types that do not have instance storage, like the M4/C4 classes.
Description
Currently node manager only distinguishes between cloud instance types. Enable the admin to specify the amount of additional storage for specific instance types on AWS.
[Size m4.large] cores = 2 scratch = 500
Implementation:
Determine how instance storage is available by default for node type. If additional space is needed, attach an EBS device.
This is configured via ex_blockdevicemappings to libcloud create_node() & documented at https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_BlockDeviceMapping.html
Disks should be VolumeType: 'gp2' (General Purpose SSD), have DeleteOnTermination: true, and specify a VolumeSize: that makes up the difference between instance storage (if any) and the required space.
The compute node boot scripts are expected to discover both instance and EBS storage devices and combine them into a single logical partition / file system. In the above example, after boot time configuration the resulting node should have a single 500 GB file system for scratch space.
Updated by Brett Smith about 10 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Morris almost 9 years ago
- Target version changed from Arvados Future Sprints to 2017-06-21 sprint
Updated by Tom Morris almost 9 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Peter Amstutz almost 9 years ago
- Target version changed from 2017-07-05 sprint to 2017-06-21 sprint
- Story points set to 1.0
Updated by Peter Amstutz almost 9 years ago
2017-06-12 19:27:01 ComputeNodeMonitorActor.3be867299275.dynamic.compute.4xphq.arvadosapi.com[12399] DEBUG: Not eligible for shut down because node state is ('unpaired', 'closed', 'boot wait', 'idle exceeded')
2017-06-12 19:27:01 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: Sending create_node request for node size Medium Instance.
scratch, size 20 4
kw {'ex_userdata': 'https://4xphq.arvadosapi.com/arvados/v1/nodes/4xphq-7ekkf-hp20ntfzt46cuo7/ping?ping_secret=3x5mf7ig73ydwngh2nw6j8xvynffo2us5zuesoqe7kzfeaq4dm', 'ex_blockdevicemappings': [{'Ebs': {'DeleteOnTermination': True, 'VolumeType': 'gp2', 'VolumeSize': 16}}], 'name': 'testing2.4xphq.arvadosapi.com'}
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] WARNING: Re-raising error (no retry): InvalidBlockDeviceMapping: Missing device name
Traceback (most recent call last):
File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/__init__.py", line 78, in retry_wrapper
ret = orig_func(self, *args, **kwargs)
File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/dispatch/__init__.py", line 133, in create_cloud_node
self.arvados_node)
File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/driver/__init__.py", line 181, in create_node
raise create_error
BaseHTTPError: InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] ERROR: Actor error InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: finished
Updated by Peter Amstutz almost 9 years ago
Fixed, set scratch space block device to /dev/xvdt
Updated by Lucas Di Pentima almost 9 years ago
- Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
- File
services/nodemanager/arvnodeman/computenode/driver/ec2.py- Line 73: Is Arvados/SLURM
scratchvalue always anint? Or would it be convenient to force that division to be anint? - Line 79:
gp2Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
- Line 73: Is Arvados/SLURM
- It seems that
FakeAwsDriverisn’t used on an integration test, missing commit?
Updated by Peter Amstutz almost 9 years ago
Lucas Di Pentima wrote:
- Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
Fixed.
- File
services/nodemanager/arvnodeman/computenode/driver/ec2.py
- Line 73: Is Arvados/SLURM
scratchvalue always anint? Or would it be convenient to force that division to be anint?
I coerced it to int() also added +1 to round up.
- Line 79:
gp2Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
Done.
- It seems that
FakeAwsDriverisn’t used on an integration test, missing commit?
I was using it for manual testing. It really just reports node sizes that look like ec2 nodes instead of the default (which look like Azure node sizes).
Updated by Lucas Di Pentima almost 9 years ago
Just a couple of details:
- Could you add a comment regarding EBS hardcoded limits? Maybe in the future that changes.
- If we're accepting a request with more storage that we can provide, should we log a warning message?
Running service/nodemanager tests locally, one test fails:
======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
return func(*args, **keywargs)
File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 241, in test_arvados_node_not_cleaned_after_shutdown_cancelled
self.check_success_flag(False, 2)
File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 197, in check_success_flag
last_flag = self.shutdown_actor.success.get(self.TIMEOUT)
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
compat.reraise(*self._data['exc_info'])
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
exec('raise tp, value, tb')
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 431, in ask
self.tell(message)
File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 398, in tell
raise ActorDeadError('%s not found' % self)
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:d7382f42-a9d0-47ec-b5b1-8ee97ccb8255) not found
The rest LGTM. Thanks.
Updated by Peter Amstutz almost 9 years ago
- Status changed from New to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:f054bc3d7d3d26962e62c2ea7c27214b08e85bb6.