Feature #22868: Collection and project import/export tool - Arvados

Actions

Copy link

Feature #22868

closed

Collection and project import/export tool

Added by Peter Amstutz 10 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

SDKs

Target version:

Development 2025-10-01

Story points:

Release:

Arvados 3.2.0

Release relationship:

Auto

Description

Export collection¶

Fetch collection record and save json to disk
Fetch the keep blocks and save them to disk (matching the same keep block directory layout used by disk cache, keepstore, etc)

Export project¶

Fetch project record and save json to disk
Fetch collections owned by the project and save them to disk (See above)
Recursively save subprojects

Import collection¶

Load the collection json from disk
Go through the manifest and upload all the keep blocks
Create a new collection record

Import project¶

Load the project json from disk
Create a new project record
Find all the exported collections owned by that project
Import those collections
Find all the exported projects owned by that project
Recursively import those projects

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Position changed from -863115 to -863114

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Description updated (diff)

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Subtask #22875 added

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-05-28 to Development 2025-06-25

Actions

Copy link

Updated by Peter Amstutz 10 months ago

Target version changed from Development 2025-06-25 to Development 2025-07-09

Actions

Copy link

Updated by Brett Smith 9 months ago

There is a branch 22868-import-export with some work. Need to check how far along it is.

Actions

Copy link

Updated by Brett Smith 9 months ago

Assigned To changed from Peter Amstutz to Tom Clegg

Tom to evaluate the state of the branch.

Actions

Copy link

Updated by Brett Smith 9 months ago

Note this documentation

Actions

Copy link

#10

Updated by Tom Clegg 9 months ago

From a quick read of git commits and the PGPincubator docs linked in #note-9 I'm seeing

arv-copy code is refactored a bit so it can be shared
arv-import and arv-export commands exist and expected to work for projects and collections
no automated tests
no web docs (but command line options are documented, which tells most of the story)
I suspect all have been used successfully on real/dev clusters

Surely the lack of tests is related to the fact that we currently have no Python tests that expect/spinup multiple clusters, or use arvados-server boot at all (that's how the controller integration tests spin up multiple clusters).

Actions

Copy link

#11

Updated by Brett Smith 9 months ago

Target version changed from Development 2025-07-09 to Development 2025-07-23

Actions

Copy link

#12

Updated by Brett Smith 9 months ago

Assigned To changed from Tom Clegg to Brett Smith

Actions

Copy link

#13

Updated by Brett Smith 8 months ago

Target version changed from Development 2025-07-23 to Development 2025-08-06

Actions

Copy link

#14

Updated by Brett Smith 8 months ago

Tom Clegg wrote in #note-10:

From a quick read of git commits and the PGPincubator docs linked in #note-9 I'm seeing

Land this in the new contrib directory with the arv-copy changes. It can come with a README based on the #note-9 docs.

Actions

Copy link

#15

Updated by Brett Smith 8 months ago

I'm really on the fence how I want to handle this. I'm writing this out in the hopes it'll help me and maybe let me solicit opinions from others.

The simplest thing we can do is a contrib directory with a requirements.txt that declares a dependency on arvados-python-client ~= 3.2.0dev0, a README that explains how to set a virtualenv with that, and then standalone import and export scripts. We could optionally add a simple test that does roundtrips an export and import back to the test cluster.

Generalizing the tool¶

I think the interface of these tools could be a little nicer if the flow was: you export any number of projects and collections to a directory, and then you can import the whole directory rather than naming the UUID of a thing to import. That seems like it would be easy enough if arv-export organized the data in a way that helped you traverse the graph. The simplest thing would be to encode this in the filename. Imagine if it wrote out objects with filenames that followed the pattern DEPTH-UUID.json, where DEPTH is a number that counts up from 0. Everything with 0 would be the specific objects the user explicitly exported; then 1 would be direct descendants of projects with depth 0; 2 for 1, etc. Then arv-import could scan for filenames matching this pattern and import them in the right order without any graph traversal logic. The UUID also tells the importer what kind of object it needs to create without actually parsing the JSON (although in principle that's already true).

(Then Keep blocks could be named MD5+SIZE.keepblock. Then we could keep all exported files in a single directory, which makes the export a little easier to transfer around; and every filename would have an extension, which just seems like a nice little courtesy to the rest of the world.)

Option two is: arv-export writes a single arvados-export.json with an array of all the objects that need to be created, in order. The advantage of this approach is it would make it easier to update the export. For example, imagine you have a project A with a single child project B that then has a bunch of children. Imagine that the user exports project B, then later decides to add project A to the export. In principle arv-export has everything it needs to know that all it needs to do is add project A to the front of arvados-export.json. Multiple files with a depth in the name might require a series of renames, which is inherently riskier.

This matters for a non-contrib v1 because the format on disk is sort of implicitly part of the API. Presumably we want arv-import to be able to import data exported by older versions of arv-export, at least for some period of time. Getting the disk format right is part of that. Even if we don't have all the functionality in version 1, having a disk format that supports the functionality we want makes our future lives easier.

Going from there¶

h-grams have also expressed interest in the ability to create Arvados objects on cluster boot. Most notably service container requests: those are meant to be the main thing users interact with on the cluster, so they should ~always be running. Submitting a new request on boot seems to be the easiest way to accomplish that. I have written a separate script to do this.

But instead of having something separate, arv-import could be the tool to do that, assuming we make the changes above, with just a little more work. I haven't looked to know how flexible it is about object creation, but at a high level it's creating Arvados objects from JSON files, and that's all my script does too. It would be nice to have this be one tool instead of two.

I think this is okay, and useful, even if arv-export doesn't export containers or requests. This is a relatively niche use. Asking users to write container request JSON by hand in the directory is okay for version 1.

Blue sky¶

If we really commit to this strategy, arv-copy could be built on top of arv-export and arv-import: it just starts both and orchestrates communication between them. This would save us from having two implementations of the same functionality.

Actions

Copy link

#16

Updated by Brett Smith 8 months ago

Discussed with PGPi. They prioritize getting tools ASAP. I explained my ideas about improving the UX and changing the disk format, and they were cool with that. They would rather get tools now, and they'll happily switch to official ones if/when we provide them.

We are going to do an even smaller version of what I said earlier. We are literally only going to merge the required changes to arv-copy. The rest of the tooling will become scripts inside the PGPi repository instead of Arvados contrib.

Actions

Copy link

#17

Updated by Brett Smith 8 months ago

Target version changed from Development 2025-08-06 to Development 2025-08-21

Actions

Copy link

#18

Updated by Brett Smith 7 months ago

Target version changed from Development 2025-08-21 to Development 2025-09-03

Actions

Copy link

#19

Updated by Brett Smith 7 months ago

Target version changed from Development 2025-09-03 to Development 2025-09-17

Actions

Copy link

#20

Updated by Brett Smith 6 months ago

Target version changed from Development 2025-09-17 to Development 2025-10-01

Actions

Copy link

#21

Updated by Brett Smith 6 months ago

22868-arv-bootstrap @ 064e7e304c9797a7748ab611cf0755cb09f46172 - developer-run-tests: #4882 - The failure is #22824 which is a known issue.

This adds contrib/bootstrap-tools with the import and export tools Peter wrote, as well as my own arv-seed. With the branch checked out, and a virtualenv activated, you can install it for testing from your checkout by running:

WORKSPACE="$PWD" pip install sdk/python
pip install --no-deps contrib/bootstrap-tools

Then you can run the tools following contrib/bootstrap-tools/README.md.

All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Yes
Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
Code is tested and passing, both automated and manual, what manual testing was done is described.
- See above. There are barely any arv-copy tests as it is so adding the infrastructure to test new options would've balloooned the scope of this ticket, so that's not done. I did manually test arv-copy and all new tools with this change.
Tested code incorporates recent main branch changes.
- Yes
New or changed UI/UX and has gotten feedback from stakeholders.
- I have retained all the flags from Peter's code. I think we can consider myself the interface reviewer for those changes and I'm fine with them.
Documentation has been updated.
- Yes
Behaves appropriately at the intended scale (describe intended scale).
- Yes
Considered backwards and forwards compatibility issues between client and server.
- N/A - pure end client changes
Follows our coding standards and GUI style guidelines.
- Yes

Actions

Copy link

#22

Updated by Zoë Ma 6 months ago

The main issue I think is that the pyproject.toml file specifies that the dependency as "arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000" However, pipx or pip currently cannot pull the version of the Python SDK that satisfies the dependency. So the user would have to install the SDK from source, but this is not straightforward (building the SDK from main using pipx/pip gives Exception: no version information available for arvados-python-client.

Actions

Copy link

#23

Updated by Brett Smith 6 months ago

Zoë Ma wrote in #note-22:

The main issue I think is that the pyproject.toml file specifies that the dependency as "arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000" However, pipx or pip currently cannot pull the version of the Python SDK that satisfies the dependency. So the user would have to install the SDK from source, but this is not straightforward (building the SDK from main using pipx/pip gives Exception: no version information available for arvados-python-client.

You need to follow the instructions I provided when I posted the branch for review:

WORKSPACE="$PWD" pip install sdk/python
pip install --no-deps contrib/bootstrap-tools

In other words, to resolve the "no version info" error, you need to set WORKSPACE to the path of your Git checkout when you install the Python SDK. I agree this is not a great experience, but it's the kind of thing that happens when you work out of main sometimes. The version requirement absolutely has to be set this way because the tools require the changes to the Python SDK that are included in this branch. After 3.2.0 is released in a month or so, a plain pip install will work just fine, so I'd rather just wait for that to happen rather than build a custom solution to make the install easier in the meantime.

Actions

Copy link

#24

Updated by Zoë Ma 6 months ago

Sorry I think I missed the obvious in my previous message. Thanks for pointing it out.

In addition, what happened during the installation was the following

1. installed Python SDK from the source tree using pip under venv
2. if continuing to install the arvados-bootstrap tool from source tree using just pip install ., the dependency test will fail, despite in step 1 the package arvados-python-client 3.2.0.dev20250918194524 was installed, the installation of arvados-bootstrap will fail, unless --no-deps is specified on the pip commandline.

My impression is that the version specification "arvados-python-client ~= 3.2.0, >= 3.2.0.dev20250918000000" seems contradictory. The dev release is ordered before 3.2.0 IIRC, but the compatible release operator ~= means the desired version should be after 3.2.0. So only releases post 3.2.0 may satisfy the constraint.

I wonder if we could remove the ~= 3.2.0 dependency constraint so that the installation can be done by pip without --no-deps?

Actions

Copy link

#25

Updated by Zoë Ma 6 months ago

Thanks to further discussion with Brett, I now better understand the concerns of this issue and I feel that the concerns I raised in the previous comments were valid anymore.

Actions

Copy link

#26

Updated by Zoë Ma 6 months ago

Once concern from me is that the naming of the Python package (arvados-bootstrap) and the directory name under contrib (bootstrap-tools) don't match each other, and I suggest this may be fixed easier now than later, by renaming the directory.

The other (slight) concern is that the summary in the README.md file under contrib says "Python scripts to transfer Arvados records between a cluster and filesystem" -- which I feel may be no longer very accurate? My suggestion is to use the simple phrasing 'scripts to initialize an Arvados cluster with data' (taken from the subproject's README) -- maybe up to some slight variants -- in the contrib/README.md summary as well as in pyprojects.toml description.

Otherwise, they all look good to merge!

Actions

Copy link

#27

Updated by Brett Smith 6 months ago

Zoë Ma wrote:

Thanks to further discussion with Brett, I now better understand the concerns of this issue and I feel that the concerns I raised in the previous comments were valid anymore.

For the record, we talked about how documentation is only meant to be applicable to final releases. Files like READMEs should not document things like how to install from Git (because we don't generally expect our users to do that).

Once concern from me is that the naming of the Python package (arvados-bootstrap) and the directory name under contrib (bootstrap-tools) don't match each other, and I suggest this may be fixed easier now than later, by renaming the directory.

The other (slight) concern is that the summary in the README.md file under contrib says "Python scripts to transfer Arvados records between a cluster and filesystem" -- which I feel may be no longer very accurate?

I made both these changes. I also updated the arvados-python-client dependency in pyproject.toml so you can install from Git now without using --no-deps. I also bumped all the version numbers.

I rebased and now the branch is at 9d2f35e6bb73ac657d9cc9de7c306977c315c1db. developer-run-tests: #4892

Actions

Copy link

#28

Updated by Zoë Ma 6 months ago

Thank you, LGTM!

Actions

Copy link

#29

Updated by Brett Smith 6 months ago

Status changed from In Progress to Resolved

Applied in changeset arvados|481682334bdb1779069fea0466e5daa5b145ad16.

Actions

Copy link

#30

Updated by Brett Smith 6 months ago

Release set to 79

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #22868

Collection and project import/export tool

Export collection¶

Export project¶

Import collection¶

Import project¶

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Peter Amstutz 10 months ago

Updated by Brett Smith 9 months ago

Updated by Brett Smith 9 months ago

Updated by Brett Smith 9 months ago

Updated by Tom Clegg 9 months ago

Updated by Brett Smith 9 months ago

Updated by Brett Smith 9 months ago

Updated by Brett Smith 8 months ago

Updated by Brett Smith 8 months ago

Updated by Brett Smith 8 months ago

Generalizing the tool¶

Going from there¶

Blue sky¶

Updated by Brett Smith 8 months ago

Updated by Brett Smith 8 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 7 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago

Updated by Zoë Ma 6 months ago

Updated by Brett Smith 6 months ago

Updated by Zoë Ma 6 months ago

Updated by Zoë Ma 6 months ago

Updated by Zoë Ma 6 months ago

Updated by Brett Smith 6 months ago

Updated by Zoë Ma 6 months ago

Updated by Brett Smith 6 months ago

Updated by Brett Smith 6 months ago