Bug #19699
closedHTTP download creates collections with too-long names, needs flag to run in runner process after submission
100%
Description
Customer issues that came up:
- Customer needs to run on a lot of samples which means launching a whole bunch of a-c-r processes. The feature that automatically transfers from HTTP to Keep happens before submitting the workflow. This will scale much better if the data transfer happens on the compute node when the workflow actually launches.
- The collection is named "Downloaded from http://..." and if the URL is too long, it will exceed the 255 character limit on collection names. a-c-r needs to account for the limit (probably also including the timestamp that gets added by ensure_unique_name) and trim the name to a valid length so it won't get rejected.
Files
Updated by Peter Amstutz about 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz about 2 years ago
- File arvados-cwl-runner-2.5.0.dev20221107214637.tar.gz arvados-cwl-runner-2.5.0.dev20221107214637.tar.gz added
Testing package
Updated by Peter Amstutz about 2 years ago
Updated test package, adds --defer-dowload
and --varying-url-params
Example command line:
arvados-cwl-runner --defer-download --varying-url-params=AWSAccessKeyId,Signature,Expires workflow.cwl params.yml
--defer-download
will perform the download after the workflow is submitted (when the runner process on the compute node actually starts)--varying-url-params
tells it to ignore these URL query parameters from any HTTP URLs when checking to see if a URL has already been downloaded to Keep.
Updated by Peter Amstutz about 2 years ago
Updated by Peter Amstutz about 2 years ago
- File arvados-cwl-runner-2.5.0.dev20221109033130.tar.gz arvados-cwl-runner-2.5.0.dev20221109033130.tar.gz added
Updated test package, add another option --prefer-cached-downloads
Example command line:
arvados-cwl-runner --defer-download --varying-url-params=AWSAccessKeyId,Signature,Expires --prefer-cached-downloads workflow.cwl params.yml
--defer-download
will perform the download after the workflow is submitted (in the runner process on the compute node)--varying-url-params
tells it to ignore the listed URL query parameters from any HTTP URLs when checking to see if a URL has already been downloaded to Keep.--prefer-cached-downloads
says that if the URL is found in Keep, use it without any further checking. This means changes in the upstream resource won't be detected, but it also means it will not error out if the upstream resource becomes inaccessible.
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Updated by Peter Amstutz about 2 years ago
- Related to Bug #19688: Launch registered workflows faster added
Updated by Peter Amstutz about 2 years ago
- File arvados-cwl-runner-2.5.0.dev20221111145442.tar.gz arvados-cwl-runner-2.5.0.dev20221111145442.tar.gz added
Updated package, no changes to downloading behavior, includes bug fix on #19688
Updated by Peter Amstutz about 2 years ago
19699-cwl-http-dl @ 420b3b25875fd56814d1ff9027b9283ff4446571
- See comment 8 for list of new options
- This branch is on #19688 so you should review that one first
Updated by Peter Amstutz about 2 years ago
19699-cwl-http-dl @ c31c6528cac695bc86d4244516e07ea316cac979
Rebased to get test fixes
Updated by Lucas Di Pentima about 2 years ago
Just a couple comments:
- The a-c-r options page (user/cwl/cwl-run-options.html) needs to be updated with these new flags. I think this feature may also deserve a proper doc section, but maybe should not block this story.
- IIRC, compute nodes don't have internet access by default. If this is the case, do you think it would be convenient to remind this potential issue when documenting
--defer-downloads
?
Updated by Peter Amstutz about 2 years ago
Lucas Di Pentima wrote in #note-14:
Just a couple comments:
- The a-c-r options page (user/cwl/cwl-run-options.html) needs to be updated with these new flags. I think this feature may also deserve a proper doc section, but maybe should not block this story.
You're right, I forgot about docs. Let's keep the issue open and I'll follow up.
- IIRC, compute nodes don't have internet access by default. If this is the case, do you think it would be convenient to remind this potential issue when documenting
--defer-downloads
?
arvados-cwl-runner always has network access to the API enabled. Compute nodes can be firewalled off from the general Internet but that's something you need to configure at the gateway level which isn't part of our standard configuration.
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Updated by Peter Amstutz about 2 years ago
19699-cwl-dl-docs @ 3cdc1e47bf435c364644ce8ef792cb42e95ac183
- Update table of options
- Add section about downloading from HTTP
Updated by Lucas Di Pentima about 2 years ago
- At
cwl-style.html.textile
file:- Line 175: The first sentence is the same as the section's title.
- Line 211:
$(runtime.outdir)
formatting is missing. - Lines 255, 256, 257: Ignoring formatting of flags with
==
, I think it would look nicer if they're formatted in monospaced font like other variables, commands, etc. - Line 263: I think the example command would be better formatted inside a codeblock.
- Even though the feature is fully described on the guide, I think we could clarify a bit more about its utility, for example: time savings, reduced traffic costs, enhanced automation, wdyt?
The rest LGTM.
Updated by Peter Amstutz about 2 years ago
Lucas Di Pentima wrote in #note-19:
- At
cwl-style.html.textile
file:
- Line 175: The first sentence is the same as the section's title.
- Line 211:
$(runtime.outdir)
formatting is missing.- Lines 255, 256, 257: Ignoring formatting of flags with
==
, I think it would look nicer if they're formatted in monospaced font like other variables, commands, etc.- Line 263: I think the example command would be better formatted inside a codeblock.
- Even though the feature is fully described on the guide, I think we could clarify a bit more about its utility, for example: time savings, reduced traffic costs, enhanced automation, wdyt?
The rest LGTM.
Addressed above comments
19699-cwl-dl-docs @ da952d583d65e9c6c7ff24ae40c4e0d0a21efd22
Updated by Peter Amstutz about 2 years ago
- Status changed from In Progress to Resolved