Project

General

Profile

Actions

Feature #20650

closed

a-c-r natively supports S3 inputs just like HTTP/S

Added by Brett Smith almost 3 years ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Peter Amstutz
Category:
CWL
Target version:
Story points:
-
Release relationship:
Auto

Description

In short, provide a native S3 version of http_to_keep. Users should be able to write S3 sources in their inputs, which a-c-r then writes to keep before executing the workflow.

This should support the existing --defer-downloads option, so the user can decide whether to copy the data when they create the container request, or let it be handled inside the container.


Subtasks 1 (0 open1 closed)

Task #22301: Review 20650-s3-downloaderResolvedLucas Di Pentima04/07/2025Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Feature #22680: store persistent credentials to external resources and provide them to a containerResolvedPeter AmstutzActions
Actions #3

Updated by Peter Amstutz about 2 years ago

  • Target version changed from Future to Development 2024-03-27 sprint
Actions #4

Updated by Peter Amstutz about 2 years ago

  • Target version changed from Development 2024-03-27 sprint to Future
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to Development 2024-12-04
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2024-12-04 to Development 2024-11-20
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2024-11-20 to Development 2024-12-04
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Tracker changed from Idea to Feature
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #11

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2025-01-08 to Development 2025-01-29
Actions #13

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #14

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2025-02-26 to Development 2025-03-19
Actions #16

Updated by Peter Amstutz about 1 year ago

20650-s3-downloader @ 8e08552a406e1d1ff133bd95a92db130b47381aa

Work in progress, just wanted to make sure it was recorded here.

Actions #17

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2025-03-19 to Development 2025-04-02
Actions #18

Updated by Peter Amstutz 12 months ago

Actions #20

Updated by Peter Amstutz 12 months ago

  • Related to Feature #22680: store persistent credentials to external resources and provide them to a container added
Actions #21

Updated by Peter Amstutz 12 months ago

20650-s3-downloader @ ba346458f5ed3d318d44063c559dcbeed82a0ffb

developer-run-tests: #4712

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • implemented s3_to_keep with similar behavior to http_to_keep (refactored to share code)
    • arvados-cwl-runner uses s3_to_keep to import from S3
    • arvados-cwl-runner captures local credentials and passes them to the container as a secret file, unless --disable-aws-credential-capture is used
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • the main follow-up is #22680 so that users can use --disable-aws-credential-capture and instead the container gets an AWS role from crunch-run
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Added tests for s3_to_keep
    • Added test for capturing local credentials and passing them as a secret
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • updated "CWL run options" and "Guidelines for writing workflows" pages
  • Behaves appropriately at the intended scale (describe intended scale).
    • no new changes in scale, but similarly to the http download feature and --defer-downloads this spreads the data transfer out to potentially many compute nodes instead of a single client node needed to do all the data transfer, which can greatly improve performance for large batch jobs
  • Considered backwards and forwards compatibility issues between client and server.
    • no changes, the metadata stored for downloads has not changed
  • Follows our coding standards and GUI style guidelines.
    • yes

Users can now provide s3:// URLs in the CWL input document and the are automatically downloaded to Keep, the way http_to_keep already does it.

Because S3 is HTTP underneath and uses HTTP headers to communicate various metadata, I was able to refactor http_to_keep and share code between them.

It gets AWS credentials from the local environment using whatever behavior is implemented in boto3.

It will do the file transfer locally before submitting the container unless --defer-downloads is provided. In that case, it will pass the s3:// to the workflow runner container and capture the credentials from local environment (whatever boto3 would have used).

I went back and forth on whether arvados-python-client should include boto3 as a standard dependency or an optional dependency. On the one hand, it's extra thing to carry around which is inelegant for users who are not using it. On the other hand, it is extra friction to have to request it explicitly for users who do want to use it. I decided to split the difference and make it so that it is only imported on first use when a s3 URL is found, so in the case where there are no s3 URLs, it is not imported at all.

See https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-j1qpgyuhap0t83b for example of this in action.

Actions #22

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2025-04-02 to Development 2025-04-16
Actions #23

Updated by Lucas Di Pentima 12 months ago

Sorry for the late review. Here're my comments/questions:

  • Trying to understand all the combinations, does this allows to use --defer-downloads and --disable-aws-credentials-capture? Looking at pathmapper.py line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
    • I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.
    • I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.
  • In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
    • Maybe we need an additional a-c-r argument for this?
  • Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using --defer-downloads?
  • File doc/user/cwl/cwl-style.html.textile.liquid: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.
Actions #24

Updated by Peter Amstutz 12 months ago

Lucas Di Pentima wrote in #note-23:

Sorry for the late review. Here're my comments/questions:

  • Trying to understand all the combinations, does this allows to use --defer-downloads and --disable-aws-credentials-capture? Looking at pathmapper.py line 123, it seems that this is not possible, as the credentials would still be passed on to the container.
    • I think this would be a useful combination for cases where the compute nodes have access to the target S3 buckets via an instance profile and IAM role, for example.

That is exactly how it is intended to work. The logic may be a bit hard to read, but what it is saying is to create a boto session when:

defer_downloads is False (i.e. we are going to download files right now)

or

aws_credential_capture is True (i.e. we are going to pass to the container)

So in the case that defer_downloads is True and aws_credential_capture is False, then it will not create a boto session.

  • I think a new test would be needed to try the combination "deferred download without credentials" making sure we're not leaking anything.

I added a test for this, and it confirmed the logic was working as intended.

  • In case of multiple AWS profiles, it might cause confusion to the user on why they're not able to download from the bucket if the correct credentials are not stored in the default profile.
    • Maybe we need an additional a-c-r argument for this?

I'm assuming users should be able to use the AWS_PROFILE environment variable to select a different profile, but you're right that it might be more accessible if there was add a command line parameter to select a profile.

  • Would it be useful (and easy enough) to add a validation step to check that the s3 URLs are indeed accessible with the given credentials, when using --defer-downloads?

I don't think that is a good idea, doing a bunch of network requests to verify that resources exist still takes time (sometimes a lot of time), which goes against the goal of options like --defer-downloads which is to defer the work and launch the workflow as fast as possible.

  • File doc/user/cwl/cwl-style.html.textile.liquid: Typo at line 288: 'Arvado' and missing 'in' at the end of the last sentence.

20650-s3-downloader @ 5d8f237ce20d9100019cca4394b6f79ef98e171f

developer-run-tests: #4724

Actions #25

Updated by Lucas Di Pentima 12 months ago

This LGTM, thanks!

Actions #26

Updated by Peter Amstutz 12 months ago

  • Status changed from In Progress to Resolved
Actions #27

Updated by Brett Smith 6 months ago

  • Release set to 79
Actions

Also available in: Atom PDF