Feature #11100
closed[CWL] Intermediary collection handling can be specified
100%
Description
Background¶
Workflows produce a lot of intermediate collections. For production workflows that are rarely re-run, the job reuse benefits are minimal, instead this is just clutter and takes up storage space that the user would rather not pay for. This is also necessary to support a roll-in/roll-out use case where a cluster only has sufficient storage to store a few complete runs and input and output data are transferred from/to somewhere else.
Requirements¶
Should be able to specify default behavior (retain or trash) but override behavior for output of specific steps.
The final output is always retained. Input should be unaffected.
Intermediate collections need to live as long as they are in use by downstream steps. When intermediate collections are no longer needed by downstream steps, they should be trashed.
Design¶
arvados-cwl-runner submits container requests; when the container completes a collection is created and reported in output_uuid. Arvados-cwl-runner can then set the trash_at field on the collection.
- API server
- Add a "output_ttl" field to container request. This value is in seconds. When the output collection is created for the container request, it should have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at). A value of <= 0 means don't set trash_at.
- Add tests.
- Update documentation
- CWL runner
- When "intermediate output TTL" is provided, container requests are submitted with output_ttl set
- Default behavior is output_ttl is None or 0 (to be consistent with current behavior.)
- When workflow completes successfully, everything marked as intermediate should be trashed immediately. Do not do this on workflow failure.
- Provide command line option to indicate that things shouldn't be delete immediately
- Custom Arvados CWL hint to specify treatment of individual step outputs
- Update documentation
Updated by Peter Amstutz almost 8 years ago
- Description updated (diff)
- Add a "output_ttl" on container request which means output will have trash_at and delete_at set now + output_ttl (assume that tokens are issued with expiry times less than trash_at)
- Intermediate outputs can be used/reused before they are trashed
- CWL runner intermediate steps are submitted with output_ttl
- CWL options to control default treatment of step outputs
Updated by Tom Morris almost 8 years ago
- Assigned To deleted (
Peter Amstutz) - Target version changed from Arvados Future Sprints to 2017-04-12 sprint
- Story points set to 3.0
Updated by Tom Clegg almost 8 years ago
Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?
This behavior doesn't seem to be documented in either of the likely placesUpdated by Tom Clegg almost 8 years ago
11100-cr-output-ttl @ 4437774e863465c0daa41dfd9716174e18d93122
Updated by Peter Amstutz almost 8 years ago
Tom Clegg wrote:
Currently it's an error to set delete_at to a time earlier than blob_signature_ttl. Some time passes between when the container request chooses a delete_at and when the collection validation checks whether delete_at is valid. This means the container request code has to add a few seconds of grace period to avoid failing during slow times. Perhaps it would be better to change the rule so, if a client sets delete_at to a value too soon to achieve safely, it automatically gets extended to the earliest possible time?
So instead of a validation error, it would just adjust it to a valid value. Yes.
Updated by Tom Clegg almost 8 years ago
With "earliest possible delete_at":
11100-cr-output-ttl @ 65121f8db54a1ed15207d050e1f48c5fc26d646b
Updated by Peter Amstutz almost 8 years ago
Documentation of the output_ttl
should specify units.
Should the code to adjust delete_at to the earliest valid time be performed before_validation
so that the actual validation
doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).
Rest LGTM
Updated by Tom Clegg almost 8 years ago
Peter Amstutz wrote:
Documentation of the
output_ttl
should specify units.
Indeed. Got it in Containers API but missed it in docs. Fixed.
Should the code to adjust delete_at to the earliest valid time be performed
before_validation
so that the actualvalidation
doesn't modify the record? (Not sure if rails is opinionated about validations being pure). (default_trash_interval does something similar and is in before_validation).
Not sure about Rails either, but yes, that seems neater. Fixed.
Also added a bit to acknowledge that no tokens will expire before the previous (hence previously validated) value of delete_at.
11100-cr-output-ttl @ ff3bb22d4b2bff5666907a6eeb6cd68cd3cbe22b
Updated by Tom Clegg over 7 years ago
- Assigned To changed from Tom Clegg to Peter Amstutz
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Updated by Lucas Di Pentima over 7 years ago
Reviewed 8645888f1 - branch 11100-cwl-set-output-ttl
Some minor comments:
- File
doc/user/cwl/cwl-extensions.html.textile.liquid:L86
- duplicated ‘before’. Also s/recommend/recommended/ ? - File
sdk/cwl/arvados_cwl/arv-cwl-schema.yml:L145
- Same as previous: s/recommend/recommended/ - If there’s no way to ensure that an intermediate output collection doesn’t get trashed while a step depending on it is running, maybe we should add a log warning when every intermediate output is created with a short TTL, for future debugging purposes. Didn’t check if this is done on the API serve side.
Apart from that, LGTM. Local tests ran ok.
Updated by Peter Amstutz over 7 years ago
I decided to add --trash-intermediate/--no-trash-intermediate as distinct options from --intermediate-output-ttl.
Updated documentation.
Not really sure about logging "TTL too short". We could pick some arbitrary number (60 seconds? 3600 seconds?) but it is highly workflow dependent.
Updated by Lucas Di Pentima over 7 years ago
- File
sdk/cwl/arvados_cwl/arvcontainer.py:L178
- Should that check be also on thearvados-cwl-runner
side? - Do you think it would be useful to add a test checking the new argument
--trash-intermediate
and its default behavior? - Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).
Updated by Lucas Di Pentima over 7 years ago
Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129
, maybe you can add that fix to the other doc updates.
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-05-24 sprint to 2017-06-07 sprint
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
- File
sdk/cwl/arvados_cwl/arvcontainer.py:L178
- Should that check be also on thearvados-cwl-runner
side?
Done.
- Do you think it would be useful to add a test checking the new argument
--trash-intermediate
and its default behavior?
Yes, as it turned out I actually did forget to propagate it to the runner command line. Test added, and code fixed.
- Instead of logging when output TTL is too short, how about logging a message noticing when creating an output collection with the ttl? (I believe this would be on the api server side, right?).
Added logging CWL about intermediate outputs.
Side note: I found a typo (Aravdos) on doc/user/cwl/cwl-runner.html.textile.liquid:129, maybe you can add that fix to the other doc updates.
Fixed.
Updated by Lucas Di Pentima over 7 years ago
Please check the tests, there's one failing on me, at least when running locally:
====================================================================== FAIL: test_done (tests.test_container.TestContainer) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched return func(*args, **keywargs) File "/home/lucas/arvados_local/sdk/cwl/tests/test_container.py", line 415, in test_done arvjob.collect_outputs.assert_called_with("keep:abc+123") File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 925, in assert_called_with raise AssertionError('Expected call: %s\nNot called' % (expected,)) AssertionError: Expected call: mock('keep:abc+123') Not called ----------------------------------------------------------------------
Updated by Peter Amstutz over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 75 to 100
Applied in changeset arvados|commit:0da65aef99133cb76661f56f2adac83a77ad8ac9.