Project

General

Profile

Actions

Feature #12430

open

Crunch2 limit output collection to glob patterns

Added by Peter Amstutz over 6 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.


Related issues

Related to Arvados - Bug #9964: [CWL][Crunch2][Crunch] crunchrunner should use CWL globs to output data to keepNewTom Morris09/07/2016

Actions
Actions #1

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 6 years ago

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

Actions #3

Updated by Peter Amstutz over 6 years ago

Tom Clegg wrote:

I'm not keen on this feature. It seems to creep in an awkward direction:
  1. output everything in this dir
  2. output everything in this dir that matches this glob
  3. output everything in this dir that matches any of these globs
  4. output everything in this dir that matches any of these globs, but not this glob
  5. output everything in this dir that matches any of these globs, and apply this path translation

Nobody is asking for 4 and 5.

Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).

The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.

Actions #4

Updated by Tom Clegg over 6 years ago

I agree arvados-cwl-runner's needs are important but I would still prefer to find a way to accommodate them without feeding the "container request includes a mini-language for munging inputs and outputs in various ways" pattern.

Actions #5

Updated by Lucas Di Pentima over 1 year ago

  • Release set to 60
Actions

Also available in: Atom PDF