Feature #5414
closed[Keep] Use data stored remotely (not in local keepstore)
67%
Updated by Tom Clegg about 10 years ago
- Category set to Keep
- Assigned To set to Tom Clegg
Updated by Tom Clegg about 10 years ago
5414-keep-service-hints branch @ 9e91179:
- Support
+K@uuid
in Python library - Support
+K@uuid
and+K@prefix
in Go library - Some Go style fixes, and better logging in keepproxy (see commit message)
Updated by Peter Amstutz about 10 years ago
Comments on https://arvados.org/projects/arvados/wiki/Keep_S3_gateway
Does POST /collection
use the API token of the client to create the collection?
If POST /collection
just initiates a indexing process with S3, it should return 202 (Accepted) not 200. "The entity returned with this response SHOULD include an indication of the request's current status and either a pointer to a status monitor or some estimate of when the user can expect the request to be fulfilled."
Suggest additional API GET /collection/{uuid}/status
, POST /collection/{uuid}/cancel@ for monitoring/control of indexing process. The "status" endpoint could be polled, or possibly return streaming status updates using chunked transfer encoding.
Suggest using an in-process database like SQLite to store the {hash, remote object} mapping instead of lots of tiny files.
Suggest GET /collections
to get a list of collections managed by the gateway.
What's the use case for POST /manifest_text
?
I think it would be better to require that the client calling POST /collections
provide the S3 credentials instead of having the server be configured with a single set of credentials. This way, clients can't ask the indexing server to index anything that the client doesn't already have access to.
Neutral on refactoring keepstore/keepproxy, except to mention that I did originally suggest that the keepproxy functionality be a module of keepstore instead of a separate server, but was rejected and it turns out that except for exposing similar REST APIs, there isn't very much overlap that isn't already in the Go SDK, and the actual HTTPServer handlers which do most of the work are completely different.
Updated by Radhika Chippada about 10 years ago
Review comments:
- Mostly looks good and I was able to follow the updates towards the syntax and comment updates. I only have a few minor comments.
- Just an observation. LocalRoots and GatewayRoots doing lock: I am understating this as, even though we are just returning a field from within kc, we want to prevent data changing due to another thread doing SetServiceRoots on the same object for example.
- As long as we are trying to be standards complaint to the extent possible with comments and names etc, I think “pathPattern” would be better than “pathpattern”
- keepclient_test.go
- Rename “func (s *StandaloneSuite) TestGetWithServiceHint(c *C) ...” as “func (s *StandaloneSuite) TestGetWithServiceHintLocalUnused(c *C) ...” to make room for the following test?
- Would it make sense to add a test with both local and gateway server configurations matching hash+"+K@"+uuid and verify that get happens from the gateway configuration? If I am understanding correctly, the gateway config takes higher precedence over the local config?
- It does not seem like we have a test with keep proxy hint? (Please forgive me if I missed it in my reading the code). If not, would it make sense to add one?
- Why did you say “Support +K@uuid in Python library” in note #3? It appears that you also added +K@prefix support in python?
- Does it make sense to add additional tests as listed in keepclient_test.go item above for the python test suite also?
- Peter’s comments above makes me wonder if the S3 interface design is considering writing collection objects into S3. I am assuming that this is not the case and we will continue to use our current object store (postgres) to store collections. Here is an article that talks about S3 and performance with small files: http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
Updated by Tom Clegg about 10 years ago
- Target version changed from 2015-04-01 sprint to Arvados Future Sprints
Updated by Tom Clegg about 10 years ago
Radhika Chippada wrote:
Review comments:
- Mostly looks good and I was able to follow the updates towards the syntax and comment updates. I only have a few minor comments.
- Just an observation. LocalRoots and GatewayRoots doing lock: I am understating this as, even though we are just returning a field from within kc, we want to prevent data changing due to another thread doing SetServiceRoots on the same object for example.
Right, RLock() avoids race conditions if someone calls SetServiceRoots() while we are doing "get pointer". (The existence of the atomic.LoadPointer()
suggests to me we shouldn't assume "get pointer" is atomic.)
- As long as we are trying to be standards complaint to the extent possible with comments and names etc, I think “pathPattern” would be better than “pathpattern”
Agreed. Changed to locatorMatcher... and while I was at it noticed that whole Locator interface was a bit sketchy, and cleaned it up. (Now it returns an error if there was an error, instead of a Locator with empty fields, etc.) AFAICT the only purpose this whole thing actually serves anywhere is to get the size hint in keepproxy.go, but presumably the idea is to use it elsewhere eventually, so I figured it should be at least capable of handling all valid locators without losing info.
- keepclient_test.go
- Rename “func (s *StandaloneSuite) TestGetWithServiceHint(c *C) ...” as “func (s *StandaloneSuite) TestGetWithServiceHintLocalUnused(c *C) ...” to make room for the following test?
- Would it make sense to add a test with both local and gateway server configurations matching hash+"+K@"+uuid and verify that get happens from the gateway configuration? If I am understanding correctly, the gateway config takes higher precedence over the local config?
The client should try the +K@{foo} services in the order specified (without regard to service type), then fall back to the usual local server probe order. I've added a test case where +K@{foo} points to a local service.
- It does not seem like we have a test with keep proxy hint? (Please forgive me if I missed it in my reading the code). If not, would it make sense to add one?
That would be nice... I figured this would require mocking http requests, though, and I didn't quite get that far. (Do we have any examples of this sitting around somewhere?)
- Why did you say “Support +K@uuid in Python library” in note #3? It appears that you also added +K@prefix support in python?
Python SDK already supported this, although I did rearrange it a bit while adding the +K@uuid support.
- Does it make sense to add additional tests as listed in keepclient_test.go item above for the python test suite also?
test_get_with_gateway_hints_in_order()
and test_get_with_remote_proxy_hint()
are meant to cover various permutations; what do you have in mind?
- Peter’s comments above makes me wonder if the S3 interface design is considering writing collection objects into S3. I am assuming that this is not the case and we will continue to use our current object store (postgres) to store collections. Here is an article that talks about S3 and performance with small files: http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
No, there are no plans to store collection records / manifests in S3. (We don't even store them in Keep now, except as scratch space during crunch jobs...)
Updated by Tom Clegg about 10 years ago
- Target version changed from Arvados Future Sprints to Bug Triage
Updated by Tom Clegg about 10 years ago
- Target version changed from Bug Triage to 2015-04-29 sprint
Updated by Radhika Chippada about 10 years ago
Tom:
Just a few minor observation about the updates.
- formatting is off for keepclient.rb (may be you need to rerun go fmt ...)
- would it make sense to also add +K hint to at least one of the TestMakeLocator version in keepclient_test.go?
- Just noticed this file name in git diff and wondering why “zz_load_config.rb” was called as such instead of something like “arv_load_config.rb”? It appears that it was load_config.rb initially … Same question about the other sleepy file
- You asked about keepproxy test mocking: “Do we have any examples of this sitting around somewhere?”. I do not seem to find any such tests. We can revisit this at a later time if adding such test(s) is not feasible at this time.
Updated by Tom Clegg about 10 years ago
Radhika Chippada wrote:
Tom:
Just a few minor observation about the updates.
- formatting is off for keepclient.rb (may be you need to rerun go fmt ...)
Indeed, several formatting things fixed with gofmt -s
, thanks.
- would it make sense to also add +K hint to at least one of the TestMakeLocator version in keepclient_test.go?
Yes. Added a test with a badly formatted +K and an entirely unrecognizable +U, to make sure it's forward-compatible with future kinds of hints.
- Just noticed this file name in git diff and wondering why “zz_load_config.rb” was called as such instead of something like “arv_load_config.rb”? It appears that it was load_config.rb initially … Same question about the other sleepy file
zz_load_config has to run after secret_token.rb (if the latter exists) and zz_preload_models has to run after zz_load_config. Alphabetical naming seems to be the official Rails solution for controlling load order, but I've added some "require_relative" statements and that seems to do what we need, explicitly and without weird filenames.
- You asked about keepproxy test mocking: “Do we have any examples of this sitting around somewhere?”. I do not seem to find any such tests. We can revisit this at a later time if adding such test(s) is not feasible at this time.
OK. This would be a good thing to do but I don't want it to hold up the "service hints" stuff. (I probably should have put the "+K@zzzzz" piece in its own branch in the first place, so "remote proxy" test issues could hold up the "remote proxy" feature without also holding up the "gateway" feature...)
Updated by Radhika Chippada about 10 years ago
Tom, LGTM. And thank you for explaining initializer loading and renaming.
Updated by Tom Clegg almost 10 years ago
- Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
Updated by Tom Clegg almost 10 years ago
- Target version changed from 2015-05-20 sprint to 2015-06-10 sprint
Updated by Brett Smith almost 10 years ago
- Target version changed from 2015-06-10 sprint to Arvados Future Sprints
Updated by Tom Clegg about 8 years ago
- Status changed from In Progress to Closed
Updated by Tom Morris about 6 years ago
- Target version deleted (
Arvados Future Sprints)