Feature #8128
closed[Crunch2] API support for crunch-dispatch
Description
See Container dispatch ("Arvados API support")
Some of the required APIs don't exist yet:- "Locked" state
- auth_uuid field: created automatically when container changes to state="Running"
- arvados.v1.containers.auth(uuid=container.uuid): retrieve the api_client_authentication record corresponding to auth_uuid
- api_client_authorizations.current: retrieve UUID of current token (similar to users.current). This avoids having to put the secret token anywhere other than the Authorization header.
- Filter containers list by API token:
containers.list?filters=[[locked_by_uuid,=,my_token_uuid],[state,in,[Running,Locked]]] - Get list of queued containers:
containers.list?filters=[[state,=,Queued]] - Get UUID of current token:
api_client_authorizations.list?filters=[[api_token,=,my_token]]
Updated by Tom Clegg almost 10 years ago
- Target version set to 2016-05-11 sprint
Updated by Peter Amstutz almost 10 years ago
// Update container status to Running
err := arv.Update("containers", uuid,
arvadosclient.Dict{
"container": arvadosclient.Dict{"state": "Locked"}},
nil)
if err != nil {
log.Printf("Error updating container %v to 'Locked' state: %v", uuid, err)
}
- The comment says the 1st step is to "set container state to locked", but it locks the container record after starting the crunch-run -process. I believe the comment is correct and the code is wrong in this case.
- If it fails to lock the container, it logs an error but continues to run it anyway
- There's nothing here that fetches the container-specific authorization token after it is locked
- The comment says the 1st step is to "set container state to locked", but it locks the container record after submitting it to the slurm queue. I believe the comment is correct and the code is wrong in this case.
- Line 161 ends at column 237, which is a bit excessive even for my 200-character-wide Emacs window.
- There's nothing here that fetches the container-specific authorization token after it is locked
To be continued..
Updated by Peter Amstutz almost 10 years ago
- Hard time following logic in validate_lock, could use some comments.
- Need row-level locking (with_lock) anytime container state field changes. Easy
way to do that with Rails?
Current design seems to be that crunch-run is responsible for fetching auth
token, but crunch-run isn't doing that yet. Crunch-run itself currently runs
with same token as dispatcher. This seems like a bad idea, since the dispatcher token is a long-lived admin token.
Suggest that crunch-run runs with auth token (requires changes to dispatchers, no change to crunch-run). This would imply that both locked_by and auth
token are permitted to update container record.
Dispatchers should honor priority list: "order by priority DESC, modified_at ASC"
Also filter priority > 0
Slurm dispatcher should pass priority on sbatch command line, --priority=<value>
When claiming a Locked container previously owned by me, should check squeue to make sure
the job exists, clean up record if not.
Updated by Peter Amstutz almost 10 years ago
Perhaps we can tweak the containers controller so that all updates take a row level lock?
Updated by Tom Clegg almost 10 years ago
Added row-level lock in container controller.
Added comments in validate_lock.
Fixed the Locked/Running state changes in dispatchers (you're right, I changed the Running transition but I should have added a Locked transition elsewhere).
Pretty sure the Running transition really should be happening inside crunch-run anyway, not dispatch, but I don't want to block the API changes behind that.
The rest of the comments are about improving dispatchers, not the API support or the changes I made to dispatchers here, right?
Updated by Peter Amstutz almost 10 years ago
@ 88d5c65 services/crunch-dispatch-local tests fails
********** Running services/crunch-dispatch-local tests **********
2016/05/11 16:06:08 authSettings: map[ARVADOS_API_HOST:0.0.0.0:37318 ARVADOS_API_HOST_INSECURE:true ARVADOS_API_TOKEN:4axaw8zxe0qm22wa6urpp5nskcne8z88cvbupv653y1njyi05h]
2016/05/11 16:06:10 About to run queued container zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 Starting container zzzzz-dz642-queuedcontainer
zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 Finished container run for zzzzz-dz642-queuedcontainer
2016/05/11 16:06:10 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-queuedcontainer. Updating it to 'Complete'
2016/05/11 16:06:13 Caught signal: interrupt
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
2016/05/11 16:06:17 Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state: "arvados API server error: 500: 500 Internal Server Error returned by 127.0.0.1:35716"
----------------------------------------------------------------------
FAIL: /var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:96: MockArvadosServerSuite.Test_APIErrorUpdatingContainerState
/var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:103:
testWithServerStub(c, apiStubResponses, "echo", "Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state")
/var/lib/arvados/test/GOPATH/src/git.curoverse.com/arvados.git/services/crunch-dispatch-local/crunch-dispatch-local_test.go:155:
c.Check(buf.String(), Matches, `(?ms).*`+expected+`.*`)
... value string = "" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:15 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
[50,000 more lines]
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n" +
... "2016/05/11 16:06:17 Caught signal: terminated\n" +
... "2016/05/11 16:06:17 About to run queued container zzzzz-dz642-xxxxxxxxxxxxxx1\n"
... regex string = "(?ms).*Error updating container zzzzz-dz642-xxxxxxxxxxxxxx1 to 'Locked' state.*"
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete'
2016/05/11 16:06:19 After crunch-run process termination, the state is still 'Running' for zzzzz-dz642-xxxxxxxxxxxxxx2. Updating it to 'Complete'
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
zzzzz-dz642-xxxxxxxxxxxxxx2
OOPS: 4 passed, 1 FAILED
--- FAIL: Test (15.27 seconds)
FAIL
Updated by Tom Clegg almost 10 years ago
- Status changed from New to In Progress
Updated by Tom Clegg almost 10 years ago
Flaky test. Seems to be one of those SIGPIPE races, because we were trying to pipe a "#!/bin/sh\necho foo bar\n" script into a mock sbatch "echo foo". Changed the mock sbatch to "sh" and it hasn't failed for me since, fwiw.
→ cbc54fd
Updated by Tom Clegg almost 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 80 to 100
Applied in changeset arvados|commit:3ede205bf0e5af7a88e04009cd66ff111e0729c3.