Project

General

Profile

Actions

Story #11789

closed

[arv-put] add a --exclude flag to allow specific files/directories to be skipped

Added by Tom Morris over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
05/30/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

arv-put should support a --exclude flag which accepts a glob pattern to exclude from the upload. The desired functionality is to mimic that supported by rsync --exclude. In the case where multiple flags are specified, all the patterns are excluded.

Some examples:
  • exclude a specific directory --exclude Images
  • exclude a specific subdirectory --exclude 'foo/bar' (all paths are relative)
  • exclude all JPGs --exclude '*.jpg'

Subtasks 1 (0 open1 closed)

Task #11825: Review 11789-arvput-exclude-flagResolvedLucas Di Pentima05/30/2017

Actions
Actions #1

Updated by Tom Morris over 7 years ago

  • Description updated (diff)
  • Story points set to 1.0
Actions #2

Updated by Lucas Di Pentima over 7 years ago

  • Assigned To set to Lucas Di Pentima
Actions #3

Updated by Lucas Di Pentima over 7 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Lucas Di Pentima over 7 years ago

The cache filename is calculated from hashing some data, including those file/dir paths that the user passed to the command. I'm filtering those matching paths when doing the first byte count and final upload, and was writing the code to also filter them out when setting up the cache file name, but now I'm in doubt if that's really what we want.

I'm thinking the use case when a big upload is executed and cancelled in the middle because the user forgot to exclude some unneeded files: If we use the --exclude flag on the cache file name calculation, it will start from scratch the upload when the pattern matches some of the provided paths.

Actions #5

Updated by Lucas Di Pentima over 7 years ago

Implementation at 6c1f8e1f0 - Branch 11789-arvput-exclude-flag
Test run: https://ci.curoverse.com/job/developer-run-tests/330/ (no new tests yet)

Before I finish writing the tests I would like to confirm that the behavior is what is expected. I stumbled an issue with how fnmatch works, as it's for only matching file names, and not file paths.

If I use fnmatch.fnmatch() with a pattern like subdir/*.jpg against a file path, it will match subdir/file.jpg and also subdir/subsubdir/anotherfile.jpg and I believe this is not what we want.

So I'm using fnmatch when a filenames type pattern is passed, and for pathnames type patterns I'm simulating what glob.glob() does, but without scanning the filesystem, so a pattern like the one described before will work correctly, only matching *.jpg files/dirs inside subdir/.

If you want to exclude *.jpg files pseudo-recursively, you can use a pattern like: subdir/*/*.jpg to control the depth.

Actions #6

Updated by Peter Amstutz over 7 years ago

  • Style suggestion, bool([]) is False, so instead of this:
path_patterns = []
exclude_paths = path_patterns if len(path_patterns) > 0 else None
if exclude_paths is not None:
    ...

You can just write this:

path_patterns = []
if path_patterns:
    ...
  • I don't like that the "exclude" filtering is implemented twice: once in expected_bytes_for and again in ArvPutUploadJob.start()

Could we move the for path in self.paths: block in ArvPutUploadJob.start() into its own method which does the work of collecting the list of files to upload, applies excludes, and computes the expected bytes? Or alternately, can the exclude logic can be refactored into a function?

  • ArvPutUploadJob takes exclude={'paths': exclude_paths, 'names': exclude_names} and then immediately unpacks it to separate variables exclude_paths and exclude_names. Is there a reason not to pass them through as separate parameters?
  • The help string needs to explain that it uses glob patterns, that filenames and paths have different behavior.
  • I think there might be a bug in pathname_match(): if I have a relative pathname of "bar" and a pattern of "bar/" it looks like it will split the pattern into ["bar", ""] which will fail to match against ["bar"].
Actions #7

Updated by Peter Amstutz over 7 years ago

One more thought, related to my comment about pathname_match(), I think you need to remove both '' and '.' from the pattern before checking the path, for example --exclude "./*.jpg" to exclude only the jpgs in the current directory but not all jpgs.

Actions #8

Updated by Lucas Di Pentima over 7 years ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Actions #9

Updated by Lucas Di Pentima over 7 years ago

Updates at 441ef97e93a951b349356df96d8a6ef604c6cab7
Test run: https://ci.curoverse.com/job/developer-run-tests/361/

Addressed previous suggestions and added some tests.

One corner case need solution: When the user wants to pass a pathname pattern to exclude all files from the immediate subdirectory, currently it isn't possible:

  • dir1/
    • file1.txt
    • file2.txt
    • subdir1/
      • file3.txt

If the user wants to exclude all .txt files inside dir1/, should the exclude parameter be used like this?

arv-put --exclude './*.txt' dir1

Actions #10

Updated by Peter Amstutz over 7 years ago

Lucas Di Pentima wrote:

Updates at 441ef97e93a951b349356df96d8a6ef604c6cab7
Test run: https://ci.curoverse.com/job/developer-run-tests/361/

Addressed previous suggestions and added some tests.

One corner case need solution: When the user wants to pass a pathname pattern to exclude all files from the immediate subdirectory, currently it isn't possible:

  • dir1/
    • file1.txt
    • file2.txt
    • subdir1/
      • file3.txt

If the user wants to exclude all .txt files inside dir1/, should the exclude parameter be used like this?

arv-put --exclude './*.txt' dir1

Yes. Maybe this is as simple as excluding '.'?

pat = [x for x in pattern.split(os.sep) if x != '' and x != '.']

Then it'll turn ['.', '*.txt'] into ['*.txt']

This is just style, but this list/filter/lambda code gets a little bit hairy:

                        dirs[:] = list(filter(
                            lambda d: not any(
                                [pathname_match(os.path.join(root_relpath, d),
                                                pat)
                                 for pat in self.exclude_paths]),
                            dirs))

I think this list comprehension does the same thing?

                        dirs[:] = [d for d in dirs if not any(
                                     pathname_match(os.path.join(root_relpath, d), pat)
                                       for pat in self.exclude_paths)]

Rest LGTM.

Actions #11

Updated by Peter Amstutz over 7 years ago

Wouldn't that be

arv-put --exclude './*.txt' .

Otherwise it would be

arv-put --exclude 'dir1/*.txt' dir1

Or does "dir1" not get added to the manifest? (Seems like it should...)

Also needs a test for --exclude './*.txt'

Actions #12

Updated by Lucas Di Pentima over 7 years ago

Updates at d53ae1523
Test run: https://ci.curoverse.com/job/developer-run-tests/371/

Cleaned up all filter(lambda x: ...) use, replacing them with list comprehensions (as they're being somewhat left behind on Python3).
Added support for pathname exclude patterns that refers to files/dirs below a given input dir argument: We're going to use './' for this, it will differentiate --exclude 'this.file' from --exclude './this.file' in the sense that the former matches all files named that way, and the latter only matches this.file files directly below the given input directories.
Updated tests & help messages.

Actions #13

Updated by Lucas Di Pentima over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:8a4a2ee17807f35eaff8be99cca780ef1fd0ba64.

Actions

Also available in: Atom PDF