Feature #22076: keep-web can create a zipfile on the fly of a collection - Arvados

Actions

Copy link

Feature #22076

closed

keep-web can create a zipfile on the fly of a collection

Added by Peter Amstutz over 1 year ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Keep

Target version:

Development 2025-05-14

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

Accessed by making a POST request to the root of the (WebDAV endpoint) for the collection on keep-web. Works by PDH or UUID.

Should work the same whether using the "inline" or "download only" endpoint. Must be the collection root, not a subdirectory.

Indicate that it should be a zipfile by providing the header Accept: application/zip (confirm that is the right MIME type).

The POST body is either empty (get the whole collection), or a JSON array of strings which are paths within the collection to be included in the zip.

These are files or directories, if a path goes to a directory then it gets the entire contents of that directory. If there is both a reference to a subdirectory and to a specific file within that subdirectory, it gets the whole subdirectory (the file reference is redundant).

The list of files/directories should be sorted so they always download in the same order.

The zip file should be streamed to avoid excessive copying or use of staging storage.

If any of the file paths requested do not exist in the collection, return an error.

Check with customer¶

We probably do not need to support Range requests, this will be confirmed with customer.

We probably don't need to compress the files, but need to check.

Consider including the ".arvados#collection" file in the zip with the collection metadata.

Answers from customer (Feb 3)¶

We probably do not need to support Range requests, this will be confirmed with customer.

Answer: no, the intended use case is people downloading using a browser, which typically don't implement resumable downloads, which would be the main reason to implement Range requests.

We probably don't need to compress the files, but need to check.

Convinced them that there are tradeoffs and isn't necessary for the initial implementation, enabling compression could be a future improvement.

Consider including the ".arvados#collection" file in the zip with the collection metadata.

Agreed that the usefulness of including metadata about the Arvados collection outweighs the risk of end user confusion.

I noted it probably should not be called exactly ".arvados#collection" since that is a special file that is used by some tools to detect directories backed by arv-mount. We should think about what to call it instead -- maybe "collection.json" ?

Subtasks 2 (0 open — 2 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Stephen Smith over 1 year ago

Here is a potentially useful streaming zip library: https://github.com/scosman/zipstreamer

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

In fact, the heavy lifting seems to be all in the "archive/zip" standard library module:

https://pkg.go.dev/archive/zip

The standard module even has feature that walks an entire "FS" object and packs all the files -- perfect for packing up a whole collection.

So zipstreamer might not actually add anything we need.

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Target version changed from Future to Development 2024-12-04

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2024-12-04 to Development 2025-01-08

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz over 1 year ago

Target version changed from Development 2025-01-08 to Development 2025-01-29

Actions

Copy link

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-01-29 to Development 2025-02-12

Actions

Copy link

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-02-12 to Development 2025-02-26

Actions

Copy link

#10

Updated by Peter Amstutz about 1 year ago

Description updated (diff)

Actions

Copy link

#11

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-02-26 to Development 2025-03-19

Actions

Copy link

#12

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-03-19 to Development 2025-04-02

Actions

Copy link

#13

Updated by Tom Clegg about 1 year ago

Target version changed from Development 2025-04-02 to Development 2025-03-19

Actions

Copy link

#14

Updated by Tom Clegg about 1 year ago

Assigned To set to Tom Clegg

Actions

Copy link

#15

Updated by Brett Smith about 1 year ago

Subtask #22642 added

Actions

Copy link

#16

Updated by Peter Amstutz about 1 year ago

Target version changed from Development 2025-03-19 to Development 2025-04-02

Actions

Copy link

#17

Updated by Tom Clegg 12 months ago

Status changed from New to In Progress

Actions

Copy link

#18

Updated by Peter Amstutz 12 months ago

Target version changed from Development 2025-04-02 to Development 2025-04-16

Actions

Copy link

#19

Updated by Tom Clegg 12 months ago

WIP: 22076-download-zip @ 34406748c8d9c0c42d2e9c1516fb6398c94cee85 -- developer-run-tests: #4728

Actions

Copy link

#20

Updated by Tom Clegg 11 months ago

22076-download-zip @ b070d4b9995b4d18ab2725898785b9f1aa65bcd5 -- developer-run-tests: #4743

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ Requesting the top level directory of a collection with an Accept: application/zip header produces a ZIP file with the contents of the collection
- ✅ Client can specify a subset of files and/or directories to include
  - ✅ POST with Content-Type: application/json -- body can contain a JSON array of file/directory names
  - ✨ POST with Content-Type: application/x-www-form-urlencoded -- body can contain URL-encoded file/directory names, like files=file1&files=file2 (OP didn't specify this but it seems like an easy/helpful thing to support)
  - ✨ GET -- query string can contain file/directory names, like ?files=file1&files=file2 (OP only specified POST, but GET seems to make more sense in the single-file / entire-collection cases where a request body isn't needed)
- ✨ Response header suggests a filename
  - "{collection name}.zip" if downloading the entire collection
  - "{collection name} - {file name}.zip" if downloading a single file specified in the request
  - "{collection name} - {count} files.zip" if downloading a directory or multiple files specified in the request
Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ automated tests added
New or changed UX/UX and has gotten feedback from stakeholders.
- See below
Documentation has been updated.
- ✅ Added section to WebDAV API page. I would welcome wordsmithing.
Behaves appropriately at the intended scale (describe intended scale).
- I haven't tested download speed, but given that we are not doing any compression, I expect Zip download speed should be about the same as regular download speed, but with less http/tcp overhead per file.
Considered backwards and forwards compatibility issues between client and server.
- n/a
Follows our coding standards and GUI style guidelines.
- ✅

UI/UX changes¶

HTTP interface¶

If a request has the Accept: application/zip header, we accept GET or POST as a download request (accepting POST lets us accept the list of files in the request body, avoiding potential problems with request URI/header size limits).

If passing the list of files in the URI query, or a form-encoded body, the parameter name is "files".

Query/form parameter include_collection_metadata=1 (any non-empty string) causes a collection.json file to be added to the zip archive.

The collection.json file looks like this

{"name":"keep-web zip test collection",
 "portable_data_hash":"6acf043b102afcf04e3be2443e7ea2ba+223",
 "properties":{"sailboat":"⛵"},
 "uuid":"zzzzz-4zz18-uhaeag75fpn1ou2"}

When downloading by PDH, collection.json contains only "portable_data_hash".

"File download" logs¶

As is, when downloading a zip file, we generate a "File download" log entry with

collection_file_path=path/to/file.txt if the zip file contains exactly one file
collection_file_path="" otherwise

We could do more here, e.g., log how many files were downloaded. Do we want to? How should we spell that log entry?

Actions

Copy link

#21

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-04-16 to Development 2025-04-30

Actions

Copy link

#22

Updated by Peter Amstutz 11 months ago

Subtask #22810 added

Actions

Copy link

#23

Updated by Peter Amstutz 11 months ago

Given that "files=" might be repeated a bunch of times, e.g. "files=blah1&files=blah2&files=blah3", would there be some benefit to using the shortest possible version "f=" e.g. "f=blah1&f=blah2&f=blah3"

Whet downloading by uuid, is there a reason for collection.json not to include description, created_at, modified_at, etc?

Could we include user info? modified_by_user_uuid isn't great because the user uuid is opaque, but we could include a partial user record:

{
 "name":"keep-web zip test collection",
 "portable_data_hash":"6acf043b102afcf04e3be2443e7ea2ba+223",
 "properties":{"sailboat":"⛵"},
 "uuid":"zzzzz-4zz18-uhaeag75fpn1ou2" 
  "modified_by_user": {
    "uuid": "zzzzz-tpzed-blahblah",
    "full_name": "Test user",
    "username": "testuser",
    "email": "testuser@example.com" 
  }
}

I'm looking at the wikipedia page for the zipfile format ( https://en.wikipedia.org/wiki/ZIP_(file_format) ) and it indicates that you can include a comment in the zip file. Maybe we could include the collection.json contents in the comment? Or some other information about the origin of the zip file?

If I select a directory to download, what filename should I get?

TestZip_SelectDirectory_JSON downloads a directory but doesn't check the disposition. TestZip_SelectFile only downloads a file.

In this example, would selecting the directory yield a zipfile called "keep-web zip test collection - dir1" or just "keep-web zip test collection" ?

We could do more here, e.g., log how many files were downloaded. Do we want to? How should we spell that log entry?

So these are audit logs and we don't want to make them too confusing to interpret. So I think your proposed behavior is fine and maybe add an extra field that indicates it was a multiple-file zip download (I don't think "downloaded 5 files" adds much information, so this would be either the request for files as provided, or just a boolean indicating that multiple files were packed into a zip).

Actions

Copy link

#24

Updated by Tom Clegg 11 months ago

Peter Amstutz wrote in #note-23:

Given that "files=" might be repeated a bunch of times, e.g. "files=blah1&files=blah2&files=blah3", would there be some benefit to using the shortest possible version "f=" e.g. "f=blah1&f=blah2&f=blah3"

IDK, I'm more inclined to make it more obvious than to save a few bytes. For me it's a contest between "file=blah1&file=blah2" and "files=blah1&files=blah2". Or "paths=blah1&..."? I don't think it matters a lot, if the documentation is clear.

Whet downloading by uuid, is there a reason for collection.json not to include description, created_at, modified_at, etc?

No, those should be easy to add.

Could we include user info? modified_by_user_uuid isn't great because the user uuid is opaque, but we could include a partial user record:

Sure, will do.

I'm looking at the wikipedia page for the zipfile format ( https://en.wikipedia.org/wiki/ZIP_(file_format) ) and it indicates that you can include a comment in the zip file. Maybe we could include the collection.json contents in the comment? Or some other information about the origin of the zip file?

Avoiding collisions with actual files is appealing, but I haven't found any conventions for what goes in the comment field -- only a few hints that some tools don't have full support for them. I suspect putting JSON there would make it inconvenient to extract.

If I select a directory to download, what filename should I get?

"{collection name} - {count} files.zip"

TestZip_SelectDirectory_JSON downloads a directory but doesn't check the disposition. TestZip_SelectFile only downloads a file.

In this example, would selecting the directory yield a zipfile called "keep-web zip test collection - dir1" or just "keep-web zip test collection" ?

Added checks to the single-directory tests. attachment; filename="keep-web zip test collection - 2 files"

We could do more here, e.g., log how many files were downloaded. Do we want to? How should we spell that log entry?

So these are audit logs and we don't want to make them too confusing to interpret. So I think your proposed behavior is fine and maybe add an extra field that indicates it was a multiple-file zip download (I don't think "downloaded 5 files" adds much information, so this would be either the request for files as provided, or just a boolean indicating that multiple files were packed into a zip).

We could log multiple_files=true... but we could just as easily log file_count=N, which is strictly more information, and still just as easy to understand, right?

Actions

Copy link

#25

Updated by Peter Amstutz 11 months ago

Tom Clegg wrote in #note-24:

Peter Amstutz wrote in #note-23:

Given that "files=" might be repeated a bunch of times, e.g. "files=blah1&files=blah2&files=blah3", would there be some benefit to using the shortest possible version "f=" e.g. "f=blah1&f=blah2&f=blah3"

IDK, I'm more inclined to make it more obvious than to save a few bytes. For me it's a contest between "file=blah1&file=blah2" and "files=blah1&files=blah2". Or "paths=blah1&..."? I don't think it matters a lot, if the documentation is clear.

Ok, sure, let's stick with "files=" then.

Avoiding collisions with actual files is appealing, but I haven't found any conventions for what goes in the comment field -- only a few hints that some tools don't have full support for them. I suspect putting JSON there would make it inconvenient to extract.

I had another thought, maybe the comment could be "Downloaded from https://..." and then the keep-web URL. That would still provide a little bit of basic information about where the archive came from.

If I select a directory to download, what filename should I get?

"{collection name} - {count} files.zip"

TestZip_SelectDirectory_JSON downloads a directory but doesn't check the disposition. TestZip_SelectFile only downloads a file.

In this example, would selecting the directory yield a zipfile called "keep-web zip test collection - dir1" or just "keep-web zip test collection" ?

Added checks to the single-directory tests. attachment; filename="keep-web zip test collection - 2 files"

So, since your answer is "none of those" that is a little surprising. Why doesn't it say "dir1" ?

We could log multiple_files=true... but we could just as easily log file_count=N, which is strictly more information, and still just as easy to understand, right?

Sounds good to me.

Actions

Copy link

#26

Updated by Tom Clegg 11 months ago

Peter Amstutz wrote in #note-25:

Avoiding collisions with actual files is appealing, but I haven't found any conventions for what goes in the comment field -- only a few hints that some tools don't have full support for them. I suspect putting JSON there would make it inconvenient to extract.

I had another thought, maybe the comment could be "Downloaded from https://..." and then the keep-web URL. That would still provide a little bit of basic information about where the archive came from.

Hm. So "Downloaded from https://download.zzzzz.arvados.api.com/by_id/zzzzz-tpzed-aaaaaaaaaaaaaaa/"? Leaving out the ?files=...&files=... part?

So, since your answer is "none of those" that is a little surprising. Why doesn't it say "dir1" ?

Last time we talked about it, it seemed worthwhile to make a special case for when the request specifies a single file by name, but not for the various other single-file/single-directory cases like

specify a single dir that happens to contain a single file
specify a single dir that contains many files
specify a single dir, as well as a file which happens to be the only file inside the dir
specify multiple dirs, of which one contains a single file, and the rest are empty
don't specify anything, but the collection happens to only contain a single file

There are lots of possibilities, I think we just decided to pick something simple.

Actions

Copy link

#27

Updated by Peter Amstutz 11 months ago

Tom Clegg wrote in #note-26:

Peter Amstutz wrote in #note-25:

Avoiding collisions with actual files is appealing, but I haven't found any conventions for what goes in the comment field -- only a few hints that some tools don't have full support for them. I suspect putting JSON there would make it inconvenient to extract.

I had another thought, maybe the comment could be "Downloaded from https://..." and then the keep-web URL. That would still provide a little bit of basic information about where the archive came from.

Hm. So "Downloaded from https://download.zzzzz.arvados.api.com/by_id/zzzzz-tpzed-aaaaaaaaaaaaaaa/"? Leaving out the ?files=...&files=... part?

Yes. Assuming that setting the comment is easy and the URL is something you already have on hand, so the implementation is easy.

So, since your answer is "none of those" that is a little surprising. Why doesn't it say "dir1" ?

Last time we talked about it, it seemed worthwhile to make a special case for when the request specifies a single file by name, but not for the various other single-file/single-directory cases like

specify a single dir that happens to contain a single file

specify a single dir that contains many files

specify a single dir, as well as a file which happens to be the only file inside the dir

specify multiple dirs, of which one contains a single file, and the rest are empty

don't specify anything, but the collection happens to only contain a single file

There are lots of possibilities, I think we just decided to pick something simple.

It makes sense that trying to choose a particular name to elevate from the selected set of files is pretty hard.

On the other hand, "collection name - 3 files.zip" has the problem that if I download 3 files, then deselect those and download 3 different files, I get a second zip file also called "collection name - 3 files.zip" (and then the browser renames it to "collection name - 3 files (1).zip") and that seems unfortunate.

Here's an idea. How about taking the list of files included in the zip and hashing the list, and including a portion of the hash in the filename? Then you would have something like "collection name - 3 files (abcd1234).zip" and "collection name - 3 files (56678def).zip" which is at least no worse than what you were going to end up with from the browser behavior.

The nice thing is that if you have two zipfiles with the same hash, they are likely to have the same contents, and if you select the same set of files later, you will get a download with the same name (although the files contents might be different -- we could include the collection PDH in the calculation in that case).

Actions

Copy link

#28

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2025-04-30 to Development 2025-05-14

Actions

Copy link

#29

Updated by Peter Amstutz 11 months ago

Blocks Feature #22831: Support for downloading multiple files as a zip file added

Actions

Copy link

#30

Updated by Tom Clegg 11 months ago

~~22076-download-zip @ 13df9a0899d977f991da2e7a616b3c1b7a743a72 -~~ developer-run-tests: #4759 -

merged main → 22076-download-zip @ 229fba59c08cde50f3c5aa0b587fbd7fb20f8ce1 -- developer-run-tests: #4761

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅ Requesting the top level directory of a collection with an Accept: application/zip header produces a Zip file with the contents of the collection
- ✅ Client can specify a subset of files and/or directories to include
  - ✅ POST with Content-Type: application/json -- body can contain a JSON array of file/directory names
  - ✨ POST with Content-Type: application/x-www-form-urlencoded -- body can contain URL-encoded file/directory names, like files=file1&files=file2 (OP didn't specify this but it seems like an easy/helpful thing to support)
  - ✨ GET -- query string can contain file/directory names, like ?files=file1&files=file2 (OP only specified POST, but GET seems to make more sense in the single-file / entire-collection cases where a request body isn't needed)
- ✨ Response header suggests a filename
  - "{collection name}.zip" if downloading the entire collection
  - "{collection name} - {file name}.zip" if downloading a single file specified in the request
  - "{collection name} - {count} files ({hash}).zip" if downloading a directory or multiple files specified in the request (hash is 8 hex digits and incorporates PDH and list of filenames, see #note-27)
- ✨ If an include_collection_metadata parameter is given, the archive will also include a collection.json file with metadata about collection and the last user who modified it (see added docs and #note-23)
- ✨ All "file download" logs now have a file_count field: 1 for regular GET reqs, N >= 0 for Zip archives. The log entry for a Zip download containing multiple files has an empty collection_file_path field. The entry for a single-file Zip download that file's path as collection_file_path, just as if it had been downloaded the usual non-Zip way.
- ✨ Zip archive comment is "Downloaded from https://..." (see docs)
Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ automated tests added
New or changed UX/UX and has gotten feedback from stakeholders.
- ✅ See above
Documentation has been updated.
- ✅ Added section to WebDAV API page. I would welcome wordsmithing.
Behaves appropriately at the intended scale (describe intended scale).
- I haven't tested download speed, but given that we are not doing any compression, I expect Zip download speed should be about the same as regular download speed, but with less http/tcp overhead per file.
Considered backwards and forwards compatibility issues between client and server.
- n/a
Follows our coding standards and GUI style guidelines.
- ✅

Actions

Copy link

#31

Updated by Brett Smith 11 months ago

Interface things¶

I have a couple of interface comments. If you want to fob these off with "we already did interface review, too late," I think that's fair and you're allowed to. But I'll at least note them.

The first is one is a new comment: I wonder if it would help future-proof the POST application/json by accepting an object in the form of {"files": ["filename", ...]} rather than just an array. That way it could easily grow to accept additional parameters in the future. We could potentially even accept include_collection_metadata this way now.

The second one is admittedly a rehash, but: I did try to follow the discussion earlier, and when we talked about including a hash in the filename I thought it was just the PDH, I didn't read closely enough. I think including an semi-arbitrary hash that doesn't obviously tie to any user-visible data is worse usability than just letting the browser add an ordinal suffix. At least with ordinals, if you're looking at a listing from ls or Explorer it's obvious which one you downloaded most recently and therefore probably has the data you're looking for. With an arbitrary hash, you're now forced to cross-reference against file create time.

HTTP compliance/improvements¶

I asked myself "Do we need to worry about what happens if the collection filename includes FAT reserved characters?" Based on my reading of Content-Disposition on MDN, it sounds like the answer is no, the browser should take care of that, but then I learned that it would probably be nicer to provide filename* rather than filename because it provides more consistent parsing of percent sequences. I don't think this is a dealbreaker but if you're up for it I think it would be a nice improvement. It seems straightforward to implement at least.

I suspect you already know that the Accept header can be way more involved than a single MIME type, but we're just doing naive string comparisons here. Same deal here: not a dealbreaker (I suspect we already have this problem elsewhere in the code) but it would be a nice improvement to handle it better. I understand it's less straightforward though.

Readability¶

So like… I feel like something about Go means we often end up writing function definitions that literally span hundreds of lines, like serveZip here. Personally I find this hard to read: when I'm looking at later parts of the function, it might rely on variables that were set pages ago. And coversely, when a variable is set, it's hard to know when I can let it swap out of my brain. Like I agree that the code is straightforward, I don't think it's hard to read because it's complicated, it's just hard to keep track of all the state when there's potentially so much of it.

It's particularly annoying when part of the reason it's so long is because there are several function definitions embedded within it. Breaking those out into named functions seems like an easy starting point for improvement. I get that they're technically all closures now, but to my eyes these all have limited scope where it would be easy to pass in what's necessary:

iswanted could be a function that takes (collfs, reqpaths), builds wanted, then returns the iswanted closure and a possible error for the 404 case.
A separate function could build filepaths by calling WalkDir and just return the final result.
The last function could take (r, w), build its own ZipWriter, then return wrote bool, err error.

I don't think this is the only way to address the concern, I don't want to dictate tho solution, this just seems like low-hanging fruit for a readability win to me.

Nits¶

When we check logs with regexps like .*foo=bar.* we are not actually testing "make sure we logged field foo with value bar," we are testing "make sure we logged a field whose name ends with foo and whose value starts with bar." Having firm anchors at the start and the end helps avoid this, like the cases where we check a fully-quoted string, that's good. I get there are some cases where this is impractically difficult, but if we could add anchors for the cases where it's reasonable, that would be a nice improvement. (If the anchors are straightforward but verbose, we could have partial strings in expectLogsMatch and then build the full regexp inside testZip.)

In serveZip, the result of err = json.NewEncoder(zipf).Encode(m) seems unused. If I'm right about that it would be a nice readability touch-up to explicitly discard the result.

In testZip, do we need to do any resource cleanup on zipr and collection.json?

Thanks.

Actions

Copy link

#32

Updated by Peter Amstutz 11 months ago

From standup discussion May 5:

Changing the JSON API to {"files": [...]} was agreed upon.

Instead of injecting hashes into the download filenames, we decided a better solution is to give the user a chance to choose the filename. This means the keep-web API needs an additional "preferred filename" parameter that takes precedence over the default naming logic (which I suppose shall just stay as originally implemented, without adding hashed values into the name).

Some ideas for the new parameter would be preferred_filename or download_filename or just filename.

Actions

Copy link

#33

Updated by Tom Clegg 11 months ago

Brett Smith wrote in #note-31:

accepting an object in the form of {"files": ["filename", ...]} rather than just an array

(see #note-32) Yes. Changed this & updated docs.

We could potentially even accept include_collection_metadata this way now.

Done.

including an semi-arbitrary hash that doesn't obviously tie to any user-visible data is worse usability than just letting the browser add an ordinal suffix.

(see #note-32)

Reverted this to just "Collection name - 3 files.zip".

✨ Added download_filename parameter, updated docs.

nicer to provide filename* rather than @filename@rainbearer81

mime.FormatMediaType() does this for us. Added a comment mentioning that. Also made sure to check this in one of the tests using download_filename.

Accept header can be way more involved than a single MIME type

Inserted mime.ParseMediaType() here so things like "application/zip; q=0.9" are accepted. We're still requiring that exactly one media type is accepted, though.

We could go further here and interpret things like

Accept: text/html, application/zip; q=0.9 → html file listing
Accept: text/html; q=0.9, application/zip → download Zip
Accept: text/html, application/zip → ?

But this feels like it will quickly turn into "do content negotiation properly in keep-web, at least for generated content", which I'm not opposed to, but probably shouldn't block the basic "download zip" feature which is more or less blocking the Workbench development.

Thoughts on whether we'll ever do this if I add a separate ticket?

function definitions that literally span hundreds of lines

Fair criticism. Moved the closures out. Also added some comments. Now it's only™ 156 lines.

When we check logs with regexps like .*foo=bar.*

Turns out this was easy to fix because

the first key is always timestamp, so other log field names always follow a space
unquoted values are always followed by a space or newline
quoted values always end in " so they were already safe

In serveZip, the result of err = json.NewEncoder(zipf).Encode(m) seems unused. If I'm right about that it would be a nice readability touch-up to explicitly discard the result.

I added the error check. Initially I was thinking "sending a truncated/empty collection.json shouldn't prevent us from sending the rest of the zip download, especially since we've already committed to 200 OK". But now I'm thinking

Most likely it's a write error (client hung up) in which case it doesn't matter whether we continue to send the actual Zip archive content
I can't even think of a way to get a JSON encoding error here given that we decoded this very information from JSON
If somehow it is an encoding error, there is a workaround (request again without the metadata file) and a more obvious failure (empty download with no Zip signature) is better than an error you might not notice (collection.json is empty or truncated but you don't notice until it's too late for workarounds).

Thoughts?

In testZip, do we need to do any resource cleanup on zipr and collection.json?

No cleanup needed on zipr. The collection.json filehandle in principle should be closed, fixed that. Also changed the logic a little so the test fails if collection.json is present but expectMetadata was nil.

22076-download-zip @ 3197a6f011a3a1a606d168f15fed43f23a14903a -- developer-run-tests: #4766

00:17:23.037 time="2025-05-06T15:52:13Z" level=error msg="postgresql connect succeeded but ping failed" error="pq: remaining connection slots are reserved for non-replication superuser connections"

22076-download-zip @ 3197a6f011a3a1a606d168f15fed43f23a14903a -- developer-run-tests: #4767

Added "release database connection" to test teardown.

22076-download-zip @ e8b1df3defdf1b4ad84faafc7ae7baba0d1f5e41 -- developer-run-tests: #4768

Actions

Copy link

#34

Updated by Brett Smith 11 months ago

Tom Clegg wrote in #note-33:

✨ Added download_filename parameter, updated docs.

One general usability thought I have about this is, in Workbench if the user writes a simple string like projectdata, we should append .zip for them. In general I feel like keep-web is probably the best place to do that, just so all clients can benefit from it, but I don't feel super-strongly about that. If you think this would better be handled in clients like Workbench, I won't argue.

(The super-fancy implementation would be: guess the MIME type of the user's provided filename, and if it's not application/zip, append .zip to the given filename.)

(You might argue that users should be allowed to use bad filenames if they want, and I don't disagree exactly, but I'd argue users who want that should override Content-Disposition and set their own output filename with curl or requests or whatever client tool they're using.)

Thoughts on whether we'll ever do [better Accept parsing] if I add a separate ticket?

I try not to predict the future for stuff like this. I think I'm a life member of Team File A Ticket, but reasonable minds may differ and I wouldn't hold up the merge on it.

Thoughts [about the collection.json error check]?

I see both sides of it, I'm fine with this version.

No cleanup needed on zipr. The collection.json filehandle in principle should be closed, fixed that. Also changed the logic a little so the test fails if collection.json is present but expectMetadata was nil.

One readability suggestion here. In c.Check(err == nil, Equals, opts.expectMetadata != nil), suggest adding a comment argument, since without it failures like "expected false, got true" basically require consulting the source.

Thanks.

Actions

Copy link

#35

Updated by Tom Clegg 11 months ago

Brett Smith wrote in #note-34:

One general usability thought I have about this is, in Workbench if the user writes a simple string like projectdata, we should append .zip for them. In general I feel like keep-web is probably the best place to do that, just so all clients can benefit from it, but I don't feel super-strongly about that. If you think this would better be handled in clients like Workbench, I won't argue.

Added this. Sure, Workbench could do it, but I can't think of a reasonable situation where this would be unwanted.

(The super-fancy implementation would be: guess the MIME type of the user's provided filename, and if it's not application/zip, append .zip to the given filename.)

Didn't do the fancy version. The alternate extension would have to be in both the client and server mimetypes db in order for this to help anyone. Forcing literal .zip (well, case-insensitive) seems safer / more predictable.

One readability suggestion here. In c.Check(err == nil, Equals, opts.expectMetadata != nil), suggest adding a comment argument, since without it failures like "expected false, got true" basically require consulting the source.

Added comment. FWIW gocheck does already show the Check line -- but not the Open() call above it. With the update, failure looks like this

zip_test.go:494:
    c.Check(err == nil, Equals, opts.expectMetadata != nil,
        Commentf("collection.json file existence (%v) did not match expectation (%v)", err == nil, opts.expectMetadata != nil))
... obtained bool = true
... expected bool = false
... collection.json file existence (true) did not match expectation (false)

22076-download-zip @ ededfee0f7446a2bbe6cb4f83e36b955e12552a9 -- developer-run-tests: #4769

Actions

Copy link

#36