Story #2879
closedCrunch uses Docker images stored in Keep, and records Keep locators used for Jobs
100%
Description
Description:
- Users put Docker images in Keep with a one-liner, hopefully
docker save imagename | arv-put
. - Users specify that Collection locator as the
docker_image
constraint on a Job. - crunch-job installs images from Keep rather than using
docker pull
. - crunch-job records the Keep Collection locator used in a new Jobs column (to enable us to support symbolic names in the runtime constraints later).
- Crunch uses information from the new column to decide whether or not a Job is reusable.
Reach goals:
- Support Docker tags or similar functionality in the
docker_image
runtime constraint. (This gets translated so that only Keep Collection locators get stored in the new column.) - Provide a two-way mapping between Docker image hashes and Collection locators.
Updated by Brett Smith over 10 years ago
- Project changed from 37 to Arvados
- Assigned To set to Brett Smith
Updated by Brett Smith over 10 years ago
- Subject changed from Component/job can specify a docker container image hash, and full hash and image are recorded permanently in job database much like script_version. to Crunch uses Docker images stored in Keep, and records Keep locators used for Jobs
- Description updated (diff)
Updated by Brett Smith over 10 years ago
Tom and I discussed this some more. One major issue we identified with our current approach is that all of the information about Docker images lives on the compute nodes. They get to decide which specific Docker image to use, and they might even choose different images (e.g., because docker pull
got different results on different nodes for some reason). It's also impossible for other parts of the stack to reason about Docker images (i.e., is an image with this name available to use? Can a Job specifying this image name be reused?).
To solve those problems, we plan to write a small utility script to put Docker images in Arvados. It will start with a basic docker save [arguments] | arv-put
. After success, it will make name links to capture repository and tag metadata from that image.
That bit of effort brings all kinds of benefits. It lets users specify Docker images as flexibly as they do script versions, and the API server can apply all the same Job reuse logic to Docker images that it does to script versions. It automatically captures provenance data; we'll be able to answer questions like "What Docker image was the most recent from repository FOO at this date?" It will let us create more useful UI around choosing Docker images in the future. So this thin wrapper seems worth the effort.
Updated by Tim Pierce over 10 years ago
Review @ c5a9ca239
- sdk/python/arvados/commands/keepdocker.py
- Document that this script accepts and forwards any options to arv-put, so any changes to the options should be careful not to conflict with
arv_put.run_opts
. - What does
find_image_hash
do when theimage_tag
argument isNone
? It looks like it compares theDockerImage.tag
field literally to theNone
object, but I think that_get_docker_images
will ensure that the tag field is always a string. Can you document the intended behavior and confirm that it does what it's supposed to be? - If --force is given, but the user has a cached docker save file in ~/.cache/arvados/docker, should --force imply invalidating the cache with a new "docker save"? I can see arguments either way, but it may depend on what goal is served by --force.
- Not clear to me that storing a Docker image which has been already stored in Arvados should result in an error code. Writing diagnostics to stderr makes sense, but the result is the same as if the user successfully stored the image in the first place.
- Document that this script accepts and forwards any options to arv-put, so any changes to the options should be careful not to conflict with
Updated by Brett Smith over 10 years ago
Thanks for the feedback. This is all very helpful.
Tim Pierce wrote:
Review @ c5a9ca239
- sdk/python/arvados/commands/keepdocker.py
- Document that this script accepts and forwards any options to arv-put, so any changes to the options should be careful not to conflict with
arv_put.run_opts
.
Happily argparse takes care of this for us: if you try to assign an option to a parser twice, it raises an exception.
- What does
find_image_hash
do when theimage_tag
argument isNone
? It looks like it compares theDockerImage.tag
field literally to theNone
object, but I think that_get_docker_images
will ensure that the tag field is always a string. Can you document the intended behavior and confirm that it does what it's supposed to be?
You're right that passing None will mean the tag search never matches. That way, we only search for a matching hash, and not a repository+tag pair. I've refactored and documented find_image_hash
to try to help clarify this; let me know if that's better.
- If --force is given, but the user has a cached docker save file in ~/.cache/arvados/docker, should --force imply invalidating the cache with a new "docker save"? I can see arguments either way, but it may depend on what goal is served by --force.
I don't think that's necessary. Since we cache images by their Docker hash, any change to the image means we'll refer to a new cache already. The primary goal of --force
is to let the user proceed with an upload even if Links in the API server say that it's already there. Maybe because the underlying Collection has been (accidentally?) deleted, or the user knows that it's corrupt. I think using --force
should be pretty rare.
- Not clear to me that storing a Docker image which has been already stored in Arvados should result in an error code. Writing diagnostics to stderr makes sense, but the result is the same as if the user successfully stored the image in the first place.
Agreed. I changed this to 0.
Merged master and branch is now at aba0198
Updated by Tim Pierce over 10 years ago
Reviewed @ aba0198:
Brett Smith wrote:
- sdk/python/arvados/commands/keepdocker.py
- Document that this script accepts and forwards any options to arv-put, so any changes to the options should be careful not to conflict with
arv_put.run_opts
.Happily argparse takes care of this for us: if you try to assign an option to a parser twice, it raises an exception.
OK, fair.
- What does
find_image_hash
do when theimage_tag
argument isNone
? It looks like it compares theDockerImage.tag
field literally to theNone
object, but I think that_get_docker_images
will ensure that the tag field is always a string. Can you document the intended behavior and confirm that it does what it's supposed to be?You're right that passing None will mean the tag search never matches. That way, we only search for a matching hash, and not a repository+tag pair. I've refactored and documented
find_image_hash
to try to help clarify this; let me know if that's better.
It does, thanks.
I note now that if find_image_hash
is called with a short hash that matches more than one image, it'll return None
which will yield the confusing error message "No image found". How about: let find_image_hash
return a list of the images it found, and if there's more than one, print out a list of matching images and exit with an error status?
- If --force is given, but the user has a cached docker save file in ~/.cache/arvados/docker, should --force imply invalidating the cache with a new "docker save"? I can see arguments either way, but it may depend on what goal is served by --force.
I don't think that's necessary. Since we cache images by their Docker hash, any change to the image means we'll refer to a new cache already. The primary goal of
--force
is to let the user proceed with an upload even if Links in the API server say that it's already there. Maybe because the underlying Collection has been (accidentally?) deleted, or the user knows that it's corrupt. I think using--force
should be pretty rare.
OK.
Updated by Brett Smith over 10 years ago
Tim Pierce wrote:
I note now that if
find_image_hash
is called with a short hash that matches more than one image, it'll returnNone
which will yield the confusing error message "No image found". How about: letfind_image_hash
return a list of the images it found, and if there's more than one, print out a list of matching images and exit with an error status?
Sounds good to me. Done in 80ed3a2.
Updated by Peter Amstutz over 10 years ago
- In #for_latest_docker_image, except for the first query where it searches directly for a collection uuid hash, the subsequent joins are logically backwards and this makes the rest of the method unnecessarily complicated. My understanding of inner joins in SQL is that the naive algorithm iterates over all rows of the left hand table (in this case collections) and searches on the joined elements of the right hand table (the links). What you're actually trying to express is that you want to select the relevant 'docker_image_repo+tag' or 'docker_image_hash' rows from the links table (the left hand table) and join to get the linked collection (the right hand table). This will enable you to loop through the search results once, find the most recent timestamp, and then use the join to get the associated collection.
- I got a testing failure:
1) Failure: JobTest#test_can't_create_Job_with_Docker_image_locator [/home/peter/work/arvados/services/api/test/unit/job_test.rb:87]: ActiveModel::MassAssignmentSecurity::Error expected but nothing was raised.
- Arvados::V1::JobsController#create :find_or_create option doesn't take the docker image into account. At minimum it should probably default to a strict requirement that 'docker_image_locator' matches in order to re-use a job. Ideally it should be similar to script version ranges where the user can specify a range via minimum timestamp, maximum timestamp, and excluded image hashes for a given repository and tag.
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
- In #for_latest_docker_image, except for the first query where it searches directly for a collection uuid hash, the subsequent joins are logically backwards and this makes the rest of the method unnecessarily complicated. My understanding of inner joins in SQL is that the naive algorithm iterates over all rows of the left hand table (in this case collections) and searches on the joined elements of the right hand table (the links). What you're actually trying to express is that you want to select the relevant 'docker_image_repo+tag' or 'docker_image_hash' rows from the links table (the left hand table) and join to get the linked collection (the right hand table). This will enable you to loop through the search results once, find the most recent timestamp, and then use the join to get the associated collection.
This sounds good to me, but unfortunately it doesn't interact well with the other suggestion from IRC to add readable_by to part of the search. There's no way to tell readable_by "apply the generated conditions to this table that I'm joining in," it simply applies to the table of the class you call. And we definitely care about readable Collections rather than readable Links.
I guess the way forward is to add an optional table name keyword parameter to readable_by, and use that instead of the class table name when provided. That actually shouldn't be too terrible, since readable_by consistently puts the table name directly in its SQL, so we just have to fill in our new variable (defaulting to the class table name) where we currently call table_name
. And then base_search is Link.joins(collections).readable_by(...)
. Does that sound sensible to you?
- I got a testing failure:
[...]
Hrrm, that's weird. attr_protected should be making sure this initialization isn't possible. I'll have to do some more digging to figure out why we're getting inconsistent results.
- Arvados::V1::JobsController#create :find_or_create option doesn't take the docker image into account.
This is slated for future work on this story; see #2924. I wanted to get this reviewed and stable first so that I could be sure about how the new column gets treated before I start querying against it. crunch-job probably should've come later too, but I started hacking that before I realized the API server bits were a dependency.
Updated by Peter Amstutz over 10 years ago
Brett Smith wrote:
This sounds good to me, but unfortunately it doesn't interact well with the other suggestion from IRC to add readable_by to part of the search. There's no way to tell readable_by "apply the generated conditions to this table that I'm joining in," it simply applies to the table of the class you call. And we definitely care about readable Collections rather than readable Links.
I see, that makes sense and explains why the slightly convoluted logic is necessary. Please add a comment in the code about it.
I guess the way forward is to add an optional table name keyword parameter to readable_by, and use that instead of the class table name when provided. That actually shouldn't be too terrible, since readable_by consistently puts the table name directly in its SQL, so we just have to fill in our new variable (defaulting to the class table name) where we currently call
table_name
. And then base_search isLink.joins(collections).readable_by(...)
. Does that sound sensible to you?
I suggest punting on this unless it seems very easy and low-risk.
Hrrm, that's weird. attr_protected should be making sure this initialization isn't possible. I'll have to do some more digging to figure out why we're getting inconsistent results.
- Arvados::V1::JobsController#create :find_or_create option doesn't take the docker image into account.
Let me know if you need me to look at this more.
This is slated for future work on this story; see #2924. I wanted to get this reviewed and stable first so that I could be sure about how the new column gets treated before I start querying against it. crunch-job probably should've come later too, but I started hacking that before I realized the API server bits were a dependency.
Okay, that makes sense.
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
Brett Smith wrote:
I guess the way forward is to add an optional table name keyword parameter to readable_by, and use that instead of the class table name when provided. That actually shouldn't be too terrible, since readable_by consistently puts the table name directly in its SQL, so we just have to fill in our new variable (defaulting to the class table name) where we currently call
table_name
. And then base_search isLink.joins(collections).readable_by(...)
. Does that sound sensible to you?I suggest punting on this unless it seems very easy and low-risk.
I gave it a shot and thankfully it was as easy as I hoped.
Hrrm, that's weird. attr_protected should be making sure this initialization isn't possible. I'll have to do some more digging to figure out why we're getting inconsistent results.
Let me know if you need me to look at this more.
I tried to generalize the tests a bit and make them less implementation-specific. I'm curious if that helps.
Merged with master and ready for another look at 512eb32.
Updated by Peter Amstutz over 10 years ago
- readable_by should be applied to both sides of the join (links and collections). You may be able to get away with this by chaining together two calls to readable_by, one for each table.
- for_latest_docker_image is much easier to follow now, thanks bearing with me.
- One more nitpick:
if search_tag.nil? or repo_matches.empty?
seems like it should just beif repo_matches.empty?
because if we the user specified "abc123" and we found an image labeled "abc123:latest" that almost certainly what the user intended, not a hash that coincidentally starts with "abc123", even if it is newer.
Updated by Brett Smith over 10 years ago
Implemented all the suggested changes at f889daa
Updated by Peter Amstutz over 10 years ago
- Should be able to specify a range of valid images. The user should be able to bracket a time range with earliest/latest timestamps, and blacklist specific versions, just like the git commit logic. Was this feature specifically rejected or just overlooked?
- uuids_for_docker_image needs a comment explaining that it returns a list of uuids sorted in ascending order, and that the last element is the most recent. (Sorting it in descending order since the first element is usually what you want would be more intuitive, but the most important thing is to document the behavior clearly.)
Updated by Tom Clegg over 10 years ago
- Status changed from New to In Progress
- Story points changed from 3.0 to 1.0
Updated by Tom Clegg over 10 years ago
- Target version changed from 2014-06-17 Curating and Crunch to 2014-07-16 Sprint
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
- Should be able to specify a range of valid images. The user should be able to bracket a time range with earliest/latest timestamps, and blacklist specific versions, just like the git commit logic. Was this feature specifically rejected or just overlooked?
The current branch has some of this through the not in docker
functionality. For example, you could filter in docker debian, not in docker debian:sid
to get all Debian images that aren't unstable. Or you could say in docker arvados/jobs, not in docker [Docker hash]
to accept all arvados/jobs images except one you know you don't like.
Definitely this isn't as rich as the git commit range functionality. Part of that is because the underlying data model, itself, isn't as rich; git commits live in a directed acyclic graph, while Docker images just exist at specific points in time. You can look at the graph to talk about whether and how git commit A relates to git commit B, but you can't say anything about how an image with one repo+tag pair compares to another.
We could implement timestamp range functionality specifically. That would either have to be another search format for the "in docker" operator/uuids_for_docker_image method to accept; or it could be a separate filter. How were you expecting it to look?
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
- Should be able to specify a range of valid images. The user should be able to bracket a time range with earliest/latest timestamps, and blacklist specific versions, just like the git commit logic. Was this feature specifically rejected or just overlooked?
Per discussion in the office, enhanced the "in docker" filter to accept an array of arguments. We expect this to satisfy most of the real use cases.
- uuids_for_docker_image needs a comment explaining that it returns a list of uuids sorted in ascending order, and that the last element is the most recent. (Sorting it in descending order since the first element is usually what you want would be more intuitive, but the most important thing is to document the behavior clearly.)
Adjusted the actual sorting order and cleaned this up in general. Merged with master, and ready for another look at 00949da.
Updated by Peter Amstutz over 10 years ago
- Status changed from In Progress to New
Got 3 test failures:
[ 39/115] JobTest#test_Job_initialized_with_Docker_image_hash_gets_locat 1) Failure: JobTest#test_Job_initialized_with_Docker_image_hash_gets_locator [/home/peter/work/arvados/services/api/test/unit/job_test.rb:24]: <"fa3c1a9cb6783f85f2ecda037e07b8c3+167"> expected but was <"b519d9cb706a29fc7ea24dbea2f05851+249025">. [ 42/115] JobTest#test_Job_modified_with_Docker_image_hash_gets_locator 2) Failure: JobTest#test_Job_modified_with_Docker_image_hash_gets_locator [/home/peter/work/arvados/services/api/test/unit/job_test.rb:34]: <"fa3c1a9cb6783f85f2ecda037e07b8c3+167"> expected but was <"b519d9cb706a29fc7ea24dbea2f05851+249025">. [ 53/115] JobTest#test_locate_a_Docker_image_with_a_partial_hash = 0.02 3) Failure: JobTest#test_locate_a_Docker_image_with_a_partial_hash [/home/peter/work/arvados/services/api/test/unit/job_test.rb:71]: <"fa3c1a9cb6783f85f2ecda037e07b8c3+167"> expected but was <"b519d9cb706a29fc7ea24dbea2f05851+249025">.
Updated by Peter Amstutz over 10 years ago
- Status changed from New to In Progress
Updated by Brett Smith over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 72 to 100
Applied in changeset arvados|commit:ebc3fddd97628a50148ed90531b0a18d1cc867f9.