Feature #23204: a-c-r should be able to map an S3 input to a Directory - Arvados

Actions

Copy link

Feature #23204

open

a-c-r should be able to map an S3 input to a Directory

Added by Brett Smith 5 months ago. Updated 5 months ago.

Status:

New

Priority:

Normal

Assigned To:

Category:

CWL

Target version:

Story points:

Description

I wrote a workflow that took a Directory input and used that under InitialWorkDirRequirement.listing. Then I ran the workflow pointing at an entire S3 bucket. Early on a-c-r logged this error:

INFO Using Arvados credential […]
INFO S3 downloads will use AWS access key id AKIA[…]
INFO Checking Keep for s3://test-curii-brett
DEBUG Found ETag values {}
DEBUG Sending GET request with headers {}
INFO Beginning download of s3://test-curii-brett
WARNING Download error: [Errno 21] Is a directory: ''
Traceback (most recent call last):
  File "/opt/arvados-py/lib/python3.11/site-packages/arvados_cwl/pathmapper.py", line 184, in v

    results = s3_to_keep(self.arvrunner.api,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados-py/lib/python3.11/site-packages/arvados/_internal/s3_to_keep.py", line 126

    return url_to_keep(api, _Downloader(api, get_botoclient(botosession, unsigned_requests)),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados-py/lib/python3.11/site-packages/arvados/_internal/to_keep_util.py", line 2

    req = downloader.download(url, headers)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados-py/lib/python3.11/site-packages/arvados/_internal/s3_to_keep.py", line 61,

    self.target = self.collection.open(self.name, "wb")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/arvados-py/lib/python3.11/site-packages/arvados/collection.py", line 367, in open
    raise IOError(errno.EISDIR, "Is a directory", path)
IsADirectoryError: [Errno 21] Is a directory: ''

The workflow then proceeded to run with an empty directory named d41d8cd98f00b204e9800998ecf8427e+0 (i.e., the empty collection PDH) in the working directory.

IMO a-c-r should recursively download the entire bucket into a collection and stage that. This should also work with a subdirectory inside a bucket.