Feature #12430
open
Crunch2 limit output collection to glob patterns
Added by Peter Amstutz about 7 years ago.
Updated almost 2 years ago.
Release relationship:
Auto
Description
The current behavior for crunch-run is to upload all files in the output directory. This sometimes results in temporary files being uploaded that are not intended to be part of the output. Propose adding an "output_glob" field which is an array of filenames or glob patterns specifying which files and directories should be uploaded.
Related issues
1 (1 open — 0 closed)
- Description updated (diff)
I'm not keen on this feature. It seems to creep in an awkward direction:
- output everything in this dir
- output everything in this dir that matches this glob
- output everything in this dir that matches any of these globs
- output everything in this dir that matches any of these globs, but not this glob
- output everything in this dir that matches any of these globs, and apply this path translation
Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).
Tom Clegg wrote:
I'm not keen on this feature. It seems to creep in an awkward direction:
- output everything in this dir
- output everything in this dir that matches this glob
- output everything in this dir that matches any of these globs
- output everything in this dir that matches any of these globs, but not this glob
- output everything in this dir that matches any of these globs, and apply this path translation
Nobody is asking for 4 and 5.
Ideally this can all be done inside the container instead, using the shell or some other programming language of your choice. You could also add a subsequent step to the workflow that rearranges/extracts the desired files (a useful pattern for other situations too, like improving container reuse in downstream work that doesn't need to see the entire output).
The client for this feature is arvados-cwl-runner (because output globs defined in the tool wrapper), not individual tools. The specific problem is intended to solve is programs that produce extra output that we don't want to upload, but gets uploaded anyway. The obvious solution is to have some way to specify what should and should not be uploaded.
I agree arvados-cwl-runner's needs are important but I would still prefer to find a way to accommodate them without feeding the "container request includes a mini-language for munging inputs and outputs in various ways" pattern.
Also available in: Atom
PDF