Feature #22868
closedCollection and project import/export tool
Description
Export collection¶
- Fetch collection record and save json to disk
- Fetch the keep blocks and save them to disk (matching the same keep block directory layout used by disk cache, keepstore, etc)
Export project¶
- Fetch project record and save json to disk
- Fetch collections owned by the project and save them to disk (See above)
- Recursively save subprojects
Import collection¶
- Load the collection json from disk
- Go through the manifest and upload all the keep blocks
- Create a new collection record
Import project¶
- Load the project json from disk
- Create a new project record
- Find all the exported collections owned by that project
- Import those collections
- Find all the exported projects owned by that project
- Recursively import those projects
Updated by Brett Smith 9 months ago
There is a branch 22868-import-export with some work. Need to check how far along it is.
Updated by Brett Smith 9 months ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Tom to evaluate the state of the branch.
Updated by Tom Clegg 9 months ago
- arv-copy code is refactored a bit so it can be shared
- arv-import and arv-export commands exist and expected to work for projects and collections
- no automated tests
- no web docs (but command line options are documented, which tells most of the story)
- I suspect all have been used successfully on real/dev clusters
Surely the lack of tests is related to the fact that we currently have no Python tests that expect/spinup multiple clusters, or use arvados-server boot at all (that's how the controller integration tests spin up multiple clusters).
Updated by Brett Smith 9 months ago
- Target version changed from Development 2025-07-09 to Development 2025-07-23
Updated by Brett Smith 9 months ago
- Assigned To changed from Tom Clegg to Brett Smith
Updated by Brett Smith 8 months ago
- Target version changed from Development 2025-07-23 to Development 2025-08-06
Updated by Brett Smith 8 months ago
I'm really on the fence how I want to handle this. I'm writing this out in the hopes it'll help me and maybe let me solicit opinions from others.
The simplest thing we can do is a contrib directory with a requirements.txt that declares a dependency on arvados-python-client ~= 3.2.0dev0, a README that explains how to set a virtualenv with that, and then standalone import and export scripts. We could optionally add a simple test that does roundtrips an export and import back to the test cluster.
Generalizing the tool¶
I think the interface of these tools could be a little nicer if the flow was: you export any number of projects and collections to a directory, and then you can import the whole directory rather than naming the UUID of a thing to import. That seems like it would be easy enough if arv-export organized the data in a way that helped you traverse the graph. The simplest thing would be to encode this in the filename. Imagine if it wrote out objects with filenames that followed the pattern DEPTH-UUID.json, where DEPTH is a number that counts up from 0. Everything with 0 would be the specific objects the user explicitly exported; then 1 would be direct descendants of projects with depth 0; 2 for 1, etc. Then arv-import could scan for filenames matching this pattern and import them in the right order without any graph traversal logic. The UUID also tells the importer what kind of object it needs to create without actually parsing the JSON (although in principle that's already true).
(Then Keep blocks could be named MD5+SIZE.keepblock. Then we could keep all exported files in a single directory, which makes the export a little easier to transfer around; and every filename would have an extension, which just seems like a nice little courtesy to the rest of the world.)
Option two is: arv-export writes a single arvados-export.json with an array of all the objects that need to be created, in order. The advantage of this approach is it would make it easier to update the export. For example, imagine you have a project A with a single child project B that then has a bunch of children. Imagine that the user exports project B, then later decides to add project A to the export. In principle arv-export has everything it needs to know that all it needs to do is add project A to the front of arvados-export.json. Multiple files with a depth in the name might require a series of renames, which is inherently riskier.
This matters for a non-contrib v1 because the format on disk is sort of implicitly part of the API. Presumably we want arv-import to be able to import data exported by older versions of arv-export, at least for some period of time. Getting the disk format right is part of that. Even if we don't have all the functionality in version 1, having a disk format that supports the functionality we want makes our future lives easier.
Going from there¶
h-grams have also expressed interest in the ability to create Arvados objects on cluster boot. Most notably service container requests: those are meant to be the main thing users interact with on the cluster, so they should ~always be running. Submitting a new request on boot seems to be the easiest way to accomplish that. I have written a separate script to do this.
But instead of having something separate, arv-import could be the tool to do that, assuming we make the changes above, with just a little more work. I haven't looked to know how flexible it is about object creation, but at a high level it's creating Arvados objects from JSON files, and that's all my script does too. It would be nice to have this be one tool instead of two.
I think this is okay, and useful, even if arv-export doesn't export containers or requests. This is a relatively niche use. Asking users to write container request JSON by hand in the directory is okay for version 1.
Blue sky¶
If we really commit to this strategy, arv-copy could be built on top of arv-export and arv-import: it just starts both and orchestrates communication between them. This would save us from having two implementations of the same functionality.
Updated by Brett Smith 8 months ago
Discussed with PGPi. They prioritize getting tools ASAP. I explained my ideas about improving the UX and changing the disk format, and they were cool with that. They would rather get tools now, and they'll happily switch to official ones if/when we provide them.
We are going to do an even smaller version of what I said earlier. We are literally only going to merge the required changes to arv-copy. The rest of the tooling will become scripts inside the PGPi repository instead of Arvados contrib.
Updated by Brett Smith 8 months ago
- Target version changed from Development 2025-08-06 to Development 2025-08-21
Updated by Brett Smith 7 months ago
- Target version changed from Development 2025-08-21 to Development 2025-09-03
Updated by Brett Smith 7 months ago
- Target version changed from Development 2025-09-03 to Development 2025-09-17
Updated by Brett Smith 6 months ago
- Target version changed from Development 2025-09-17 to Development 2025-10-01
Updated by Brett Smith 6 months ago
22868-arv-bootstrap @ 064e7e304c9797a7748ab611cf0755cb09f46172 - developer-run-tests: #4882 - The failure is #22824 which is a known issue.
This adds contrib/bootstrap-tools with the import and export tools Peter wrote, as well as my own arv-seed. With the branch checked out, and a virtualenv activated, you can install it for testing from your checkout by running:
WORKSPACE="$PWD" pip install sdk/python
pip install --no-deps contrib/bootstrap-tools
Then you can run the tools following contrib/bootstrap-tools/README.md.
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above. There are barely any arv-copy tests as it is so adding the infrastructure to test new options would've balloooned the scope of this ticket, so that's not done. I did manually test
arv-copyand all new tools with this change.
- See above. There are barely any arv-copy tests as it is so adding the infrastructure to test new options would've balloooned the scope of this ticket, so that's not done. I did manually test
- Tested code incorporates recent main branch changes.
- Yes
- New or changed UI/UX and has gotten feedback from stakeholders.
- I have retained all the flags from Peter's code. I think we can consider myself the interface reviewer for those changes and I'm fine with them.
- Documentation has been updated.
- Yes
- Behaves appropriately at the intended scale (describe intended scale).
- Yes
- Considered backwards and forwards compatibility issues between client and server.
- N/A - pure end client changes
- Follows our coding standards and GUI style guidelines.
- Yes
Updated by Zoë Ma 6 months ago
The main issue I think is that the pyproject.toml file specifies that the dependency as "arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000" However, pipx or pip currently cannot pull the version of the Python SDK that satisfies the dependency. So the user would have to install the SDK from source, but this is not straightforward (building the SDK from main using pipx/pip gives Exception: no version information available for arvados-python-client.
Updated by Brett Smith 6 months ago
Zoë Ma wrote in #note-22:
The main issue I think is that the pyproject.toml file specifies that the dependency as
"arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000"However,pipxorpipcurrently cannot pull the version of the Python SDK that satisfies the dependency. So the user would have to install the SDK from source, but this is not straightforward (building the SDK from main using pipx/pip givesException: no version information available for arvados-python-client.
You need to follow the instructions I provided when I posted the branch for review:
WORKSPACE="$PWD" pip install sdk/python
pip install --no-deps contrib/bootstrap-tools
In other words, to resolve the "no version info" error, you need to set WORKSPACE to the path of your Git checkout when you install the Python SDK. I agree this is not a great experience, but it's the kind of thing that happens when you work out of main sometimes. The version requirement absolutely has to be set this way because the tools require the changes to the Python SDK that are included in this branch. After 3.2.0 is released in a month or so, a plain pip install will work just fine, so I'd rather just wait for that to happen rather than build a custom solution to make the install easier in the meantime.
Updated by Zoë Ma 6 months ago
Sorry I think I missed the obvious in my previous message. Thanks for pointing it out.
In addition, what happened during the installation was the following
1. installed Python SDK from the source tree using pip under venv
2. if continuing to install the arvados-bootstrap tool from source tree using just pip install ., the dependency test will fail, despite in step 1 the package arvados-python-client 3.2.0.dev20250918194524 was installed, the installation of arvados-bootstrap will fail, unless --no-deps is specified on the pip commandline.
My impression is that the version specification "arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000" seems contradictory. The dev release is ordered before 3.2.0 IIRC, but the compatible release operator ~= means the desired version should be after 3.2.0. So only releases post 3.2.0 may satisfy the constraint.
I wonder if we could remove the ~= 3.2.0 dependency constraint so that the installation can be done by pip without --no-deps?
Updated by Zoë Ma 6 months ago
Once concern from me is that the naming of the Python package (arvados-bootstrap) and the directory name under contrib (bootstrap-tools) don't match each other, and I suggest this may be fixed easier now than later, by renaming the directory.
The other (slight) concern is that the summary in the README.md file under contrib says "Python scripts to transfer Arvados records between a cluster and filesystem" -- which I feel may be no longer very accurate? My suggestion is to use the simple phrasing 'scripts to initialize an Arvados cluster with data' (taken from the subproject's README) -- maybe up to some slight variants -- in the contrib/README.md summary as well as in pyprojects.toml description.
Otherwise, they all look good to merge!
Updated by Brett Smith 6 months ago
Zoë Ma wrote:
Thanks to further discussion with Brett, I now better understand the concerns of this issue and I feel that the concerns I raised in the previous comments were valid anymore.
For the record, we talked about how documentation is only meant to be applicable to final releases. Files like READMEs should not document things like how to install from Git (because we don't generally expect our users to do that).
Once concern from me is that the naming of the Python package (arvados-bootstrap) and the directory name under contrib (bootstrap-tools) don't match each other, and I suggest this may be fixed easier now than later, by renaming the directory.
The other (slight) concern is that the summary in the README.md file under contrib says "Python scripts to transfer Arvados records between a cluster and filesystem" -- which I feel may be no longer very accurate?
I made both these changes. I also updated the arvados-python-client dependency in pyproject.toml so you can install from Git now without using --no-deps. I also bumped all the version numbers.
I rebased and now the branch is at 9d2f35e6bb73ac657d9cc9de7c306977c315c1db. developer-run-tests: #4892
Updated by Brett Smith 6 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|481682334bdb1779069fea0466e5daa5b145ad16.