Feature #17170
closedShell into container proof of concept
Added by Peter Amstutz about 4 years ago. Updated over 3 years ago.
100%
Description
Initial proof of concept: go program that runs ssh server that runs bash. Confirm that connecting with ssh gets a working pseudoterminal and that signals and control characters (e.g. SIGWINCH, backspace) work correctly. Fullscreen interactive programs like less
should work.
https://godoc.org/golang.org/x/crypto/ssh#example-NewServerConn
Minimum viable feature, once we've confirm the proof of concept:
- Crunch-run updates the container record with the IP address and port where it is listening (need to add new database column?)
- Connect to crunch-run with ssh
- Accept a token for password authentication
- run "docker exec -ti" on the container with stdin/stdout/stderr connected.
Updated by Peter Amstutz about 4 years ago
I did some reading about PTYs.
The gist seems to be that you have pair of character devices (unfortunately named "master" and "slave" devices). Characters typed at the keyboard are sent to the "master" device. The "master" device forwards those keys to the "slave" device which is connected to stdin of one or more processes (*).
When the master device sees special control codes like ^C, instead of forwarding it directly it turns it into some other kind of action such as sending SIGINT to the process group. Programs that want to use ^C as a regular key (e.g. emacs) tell the PTY through local ioctl() to change the special behavior of those codes.
This seems to suggest that just copying data between ssh session and the PTY created by Docker means that most things will "just work".
For window resizes, the PTY needs to be told about it by something else, using ioctl():
https://stackoverflow.com/questions/63230001/how-does-sigwinch-pass-through-bash
To support both interactive and batch sessions, we want to handle the request a PTY message.
According to the SSH RFC there are special SSH messages for window resize and sending signals
We want to handle translating window resize into the appropriate ioctl().
The SSH messages for sending signals are mostly not needed for interactive sessions, but would be useful for batch sessions.
(*) I think more than one process can have the same stdin, but inbound characters are read only once, so if more than one process was reading stdin at once they would race? Unless there's some notion of a foreground process that is the only one allowed to read from the PTY. However this works exactly, we shouldn't need to worry about it because it happens between the master and slave devices of the PTY.
Updated by Peter Amstutz about 4 years ago
https://pkg.go.dev/golang.org/x/crypto/ssh#example-NewServerConn
This shows how to have a basic ssh server, which includes handling out-of-band requests.
I suspect the normal sequence for an interactive session is something like "session" "req-pty" "shell"
I suspect the normal sequence for a non-interactive session is something like "session" "exec"
Updated by Peter Amstutz about 4 years ago
One additional wrinkle. Normal Docker logging works by connecting to a multiplexed socket (stdout/stderr from the container is routed through the Docker daemon). I don't know what it does when you request a PTY.
It might work the same way as logging for stdout/stderr (maybe same message format for stdin?), and the out of band messaging (sending signals, terminal resize) are API calls to Docker (will need to dig through the client API).
Updated by Tom Clegg about 4 years ago
- Related to Story #17103: Developer shell inside running container added
Updated by Peter Amstutz about 4 years ago
- Target version deleted (
2021-01-06 Sprint)
Updated by Peter Amstutz almost 4 years ago
- Assigned To set to Tom Clegg
- Target version set to 2021-01-20 Sprint
Updated by Tom Clegg almost 4 years ago
- Related to Feature #17206: crunch-run reverse proxies HTTP requests to container added
Updated by Tom Clegg almost 4 years ago
- Related to Story #17207: External access to web services running in containers added
Updated by Tom Clegg almost 4 years ago
Looking for some initial feedback.
17170-container-shell @ dad86790c5f3edbcf38702542ba46cb7a9e7f42a -- developer-run-tests: #2263- All parts are implemented: arvados-dispatch-cloud→crunch-run→controller to set the gateway_address field, and arvados-client→ssh→arvados-client→arvados-controller→crunch-run→docker-exec→container to log in.
- Manual testing is easy on ce8i5 if an auto-deploy hasn't overwritten controller/a-d-c: submit a container request (more interesting if you set {API: true} to get network access), then "go run ./cmd/arvados-client shell $container_request_uuid"
- You can add SSH client options after the UUID: "arvados-client shell $container_request_uuid -v -o controlpath=none"
- You can specify a user account inside the container (default is root): "arvados-client shell nobody@$container_request_uuid"
- You can specify a command: "echo foo | arvados-client shell nobody@$container_request_uuid tr o e"
- ^C, ^Z, and terminal window size change are all working.
- The default "detach" key sequence is ^]^] (instead of docker's default ^P^Q, which I find super annoying because it conflicts with the readline up-arrow). There appears to be no way to disable the key sequence entirely. I haven't looked into what it would take to reattach a detached session; I'm more inclined to recommend screen/tmux. (Thoughts?)
- Currently, permission is "you were the last user to touch all of the CRs associated with this container". We might want to make this less restrictive in future, e.g., support can_login links, but I think this is reasonable for a first release.
- Currently, no attempt to "mark container as not eligible for reuse".
- Uses command line tool ("docker exec -i ...") instead of the Go client: it was easy to write, and will be easy to mock out for testing. Do we have any motivation to use the native client instead, like we do for our other docker interactions in crunch-run?
- Configuration: do we want the site-wide flag to disable it? (Could check flag & reject in controller.)
- Do we want a user-accessible (per-container) flag to disable it?
- Upgrade note: sysadmin must add "connection upgrade" lines to Nginx configuration.
- Tests: need lib/controller/localdb permission test for containers with multiple associated CRs for different users.
- Tests: need a test that uses a real SSH client to run a command via lib/crunchrun gateway.
- Documentation for the client-facing part. (Where? I don't think we even have docs for installing/using the arvados-client program yet.)
Updated by Ward Vandewege almost 4 years ago
Tom Clegg wrote:
Looking for some initial feedback.
17170-container-shell @ dad86790c5f3edbcf38702542ba46cb7a9e7f42a -- developer-run-tests: #2263
- All parts are implemented: arvados-dispatch-cloud→crunch-run→controller to set the gateway_address field, and arvados-client→ssh→arvados-client→arvados-controller→crunch-run→docker-exec→container to log in.
- Manual testing is easy on ce8i5 if an auto-deploy hasn't overwritten controller/a-d-c: submit a container request (more interesting if you set {API: true} to get network access), then "go run ./cmd/arvados-client shell $container_request_uuid"
Yes! It worked! That was pretty awesome.
- You can add SSH client options after the UUID: "arvados-client shell $container_request_uuid -v -o controlpath=none"
- You can specify a user account inside the container (default is root): "arvados-client shell nobody@$container_request_uuid"
- You can specify a command: "echo foo | arvados-client shell nobody@$container_request_uuid tr o e"
- ^C, ^Z, and terminal window size change are all working.
- The default "detach" key sequence is ^]^] (instead of docker's default ^P^Q, which I find super annoying because it conflicts with the readline up-arrow). There appears to be no way to disable the key sequence entirely. I haven't looked into what it would take to reattach a detached session; I'm more inclined to recommend screen/tmux. (Thoughts?)
I think that would be fine to start.
- Currently, no attempt to "mark container as not eligible for reuse".
We should probably figure out what makese sense to do here. Cf. the comment below about allowing people to enable/disable the feature per workflow.
A few notes:
- should probably check that ARVADOS_API_HOST and ARVADOS_API_TOKEN are defined, and error out if not.
- cmd/arvados-client/container_gateway.go should validate that targetUUID is a valid CR/container UUID. I get this when I give it a collection UUID, for instance:
$ go run ./cmd/arvados-client shell ce8i5-4zz18-t1qaidhvwn7rusu error setting up tunnel: server did not provide a tunnel: request failed: http://localhost:8004/arvados/v1/containers/ce8i5-4zz18-t1qaidhvwn7rusu: 404 Not Found: Path not found (req-yewplazjko7817hghxxk) (HTTP 404) kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 exit status 255
- Also, in the above error, it would probably be better not to leak the internal URL
- It would be nice to drop the last 3 lines when something goes wrong, they don't seem to add anything useful, and the exit code appears to be always 1 (is that intentional?):
kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 exit status 255
- Wrt permissions checking: currently, permission is "you were the last user to touch all of the CRs associated with this container".
This seems a bit incongruous with the rest of Arvados. Is there a reason to not use normal ownership checks? Can we at least let admins access any container, please?
- on handleSSH in `lib/crunchrun/container_gateway.go`:
// handleSSH connects to an SSH server that runs commands as root in // the container. The tunnel itself can only be created by an // authenticated caller, so the SSH server itself is wide open (any // password or key will be accepted).
I assume that should read "that runs commands as root on the compute node", right?
- (same file/function) defaulting the username to "root" makes sense in Docker land, but isn't going to be right for Singularity. Not sure if that matters at this point.
- You asked: uses command line tool ("docker exec -i ...") instead of the Go client: it was easy to write, and will be easy to mock out for testing. Do we have any motivation to use the native client instead, like we do for our other docker interactions in crunch-run?
I guess this creates a new dependency on the docker executable, which we didn't have before. This could cause some issues with paths etc. Is it much harder to use the native client?
TODO:
- Configuration: do we want the site-wide flag to disable it? (Could check flag & reject in controller.)
Yeah, probably default to disabled, I think. Restarting controller is the only thing that would be needed when the value is changed?
- Do we want a user-accessible (per-container) flag to disable it?
We probably need something here, to make sure that people can still run "pure" containers. Would this be an arvados extension like the ones listed here: https://doc.arvados.org/v2.1/user/cwl/cwl-extensions.html ? Would it still be subject to the global flag? How do we want this to work?
- Upgrade note: sysadmin must add "connection upgrade" lines to Nginx configuration.
Yep.
- Tests: need lib/controller/localdb permission test for containers with multiple associated CRs for different users.
- Tests: need a test that uses a real SSH client to run a command via lib/crunchrun gateway.
OK.
- Documentation for the client-facing part. (Where? I don't think we even have docs for installing/using the arvados-client program yet.)
We're going to add a command line tool reference section, ticket will be forthcoming. That page would encompass what's on https://doc.arvados.org/v2.1/sdk/cli/subcommands.html, as well. Putting a page adjacent to that page with arvados-client documentation would be a fine place in the interim.
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2021-01-20 Sprint to 2021-02-03 Sprint
Updated by Tom Clegg almost 4 years ago
- Related to Story #17284: [controller] Redact internal RailsAPI URLs from error messages returned to caller added
Updated by Tom Clegg almost 4 years ago
Ward Vandewege wrote:
- Currently, no attempt to "mark container as not eligible for reuse".
We should probably figure out what makese sense to do here. Cf. the comment below about allowing people to enable/disable the feature per workflow.
Perhaps for now we can mention the implications for reuse, leave it globally-disabled by default, and label it "experimental"? That way we can figure out a proper solution for reuse/permissions, and we/others can play with it in the meantime.
- should probably check that ARVADOS_API_HOST and ARVADOS_API_TOKEN are defined, and error out if not.
Yes, all of the arvados-client stuff is awkward this way. I think we can solve this across the board (and for other SDK users too) by having all requests return "config error" instead of a confusing "no host in url" message.
- cmd/arvados-client/container_gateway.go should validate that targetUUID is a valid CR/container UUID. I get this when I give it a collection UUID, for instance:
Fixed, now the error is "target UUID is not a container or container request UUID: %s"
- Also, in the above error, it would probably be better not to leak the internal URL
This is a general arvados-controller habit. I see your point (it's confusing to see an unfamiliar host:port, and potentially reveals internal network details), OTOH it's potentially useful for troubleshooting -- e.g., it tells us the error is coming from railsAPI, not controller. Perhaps it should be configurable? Either way I think we should deal with this as a separate issue, independently of the container-shell story, so I added #17284.
Those 3 lines come from the OpenSSH client. We could...
- It would be nice to drop the last 3 lines when something goes wrong, they don't seem to add anything useful, and the exit code appears to be always 1 (is that intentional?):
kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 exit status 255
- do a probe first, before exec()ing the ssh client, which would catch the most common errors without extra SSH noise
- write an extra blank line or some other sort of ascii art to make the real message stand out better
- rearrange the tunnel setup so we exec() ssh only after the tunnel is established (but I think this would be way more trouble than it's worth)
- Wrt permissions checking: currently, permission is "you were the last user to touch all of the CRs associated with this container".
This seems a bit incongruous with the rest of Arvados. Is there a reason to not use normal ownership checks? Can we at least let admins access any container, please?
Allowing admin sounds good, yes.
Permissions are special because of container reuse. Normal Arvados permissions mean no user can touch any container: the only thing you can do to a container is cancel an associated container request, and even that doesn't necessarily cancel the container, because it might also be satisfying other [users'] container requests.
This was an attempt to avoid the most obvious cases of accidentally infiltrating and altering part of someone else's workflow. It's not reliable -- e.g., the container could still be used to satisfy future container requests, even after it's been interfered with.
Perhaps for the sake of experimenting, we could do something like this:
Containers:
ShellAccess:
# Any admin can start an interactive shell (with any user ID) in
# any running container, with any user ID
Admin: true
# Any user can start an interactive shell (with any user ID) in
# any running container that isn't also associated with a
# different user's container request
User: true
# Lift the "different user's container request" restriction: allow
# shell logins even in a container that, due to automatic
# container reuse, is also associated with concurrent container
# requests that were submitted by different users.
InSharedContainers: true
"InSharedContainers" feels like a weird name, suggestions appreciated for all of those keys.
// handleSSH connects to an SSH server that runs commands as root in // the container. The tunnel itself can only be created by an // authenticated caller, so the SSH server itself is wide open (any // password or key will be accepted).I assume that should read "that runs commands as root on the compute node", right?
The ssh server itself runs on the host, but it starts commands in the container, not on the compute/worker host. It should say "as a specified user" instead of "as root", though. And perhaps this would be clearer? "... SSH server that allows the caller to run interactive commands inside the container" ?
- (same file/function) defaulting the username to "root" makes sense in Docker land, but isn't going to be right for Singularity. Not sure if that matters at this point.
Yes... I think it's okay to leave it like this for now, and figure out different "default" behavior when we handle Singularity?
- You asked: uses command line tool ("docker exec -i ...") instead of the Go client: it was easy to write, and will be easy to mock out for testing. Do we have any motivation to use the native client instead, like we do for our other docker interactions in crunch-run?
I guess this creates a new dependency on the docker executable, which we didn't have before. This could cause some issues with paths etc. Is it much harder to use the native client?
It looks like we have a couple of options: https://godoc.org/github.com/docker/docker/client#Client.ContainerExecAttach or, if we need anything extra the real CLI tool does, we can import that part of the CLI tool bu calling container.NewExecCommand() from command/container in https://github.com/docker/cli.
- Configuration: do we want the site-wide flag to disable it? (Could check flag & reject in controller.)
Yeah, probably default to disabled, I think. Restarting controller is the only thing that would be needed when the value is changed?
Yes. I figure we should always run the gateway server in crunch-run regardless of config, so it can be enabled for an already-running container by changing config & restarting controller.
- Do we want a user-accessible (per-container) flag to disable it?
We probably need something here, to make sure that people can still run "pure" containers. Would this be an arvados extension like the ones listed here: https://doc.arvados.org/v2.1/user/cwl/cwl-extensions.html ? Would it still be subject to the global flag? How do we want this to work?
I think we'll want the global config flag to be capable of completely disabling the feature, i.e., in order to work you'd need the global config and the per-container flag both set to "allow".
Updated by Tom Clegg almost 4 years ago
17170-container-shell @ 0945afa5523fb45f827750e4d1700df4ff222295 -- developer-run-tests: #2284
- configuration:
- Containers > ShellAccess > User: true = users can access their own containers (but not if shared with other users' CRs)
- Containers > ShellAccess > Admin: true = admins can access any container
- If feature is disabled, user-facing error message says so ("shell access is disabled in config")
- If config is changed, only need to restart arvados-controller
- per-container flag to indicate it has been used: "interactive_session_started"
- upgrade note re nginx conf: added
- test using real ssh client: added
- return more helpful error if target UUID is not a container/CR: done
- avoid 3 lines of irrelevant messages from SSH client when tunnel setup fails: now we start by setting up a throwaway tunnel, and exit early if that doesn't work. I expect this will catch the vast majority of error cases (permission, config, container not running, bad target uuid).
- handleSSH comment: updated
- return more helpful error if ARVADOS_* env vars are not set: done (turns out this wasn't the generic fix I had in mind)
- use native docker client instead of docker cli program (this is a little beyond trivial, and in practice the docker cli tool already exists on worker nodes because it comes with the docker daemon)
- different default user for singularity
- test using container with other users' CRs
- per-container flag to disable
- interim documentation for user-facing part
Updated by Ward Vandewege almost 4 years ago
Some more review comments. I've collected these error messages that seem a bit too cryptic for a user-facing tool:
$ ./arvados-client connect-ssh ce8i5-dz642-xrck78umjn2xeri
error setting up tunnel: server did not provide a tunnel: Path not found (req-1l9riyl6lyahu17kpyaj) (HTTP 404)
$ ./arvados-client connect-ssh ce8i5-dz642-xrck78umjn2xeri
error setting up tunnel: server did not provide a tunnel: container is running but gateway is not available (HTTP 503)
$ ./arvados-client connect-ssh ce8i5-dz642-5whvuvuqnqb88r6
error setting up tunnel: server did not provide a tunnel: gateway is not available, container is locked (HTTP 503)
Can we make these messages more meaningful?
On the contrary, I like this error message, it is clear:
$ ./arvados-client connect-ssh ce8i5-dz642-xrck78umjn2xeri
error setting up tunnel: server did not provide a tunnel: shell access is disabled in config (HTTP 503)
- About the `connect-ssh` subcommand for arvados-client: ss that expected to be useful in a more general way to an end user? If so, we should probably add some more information about that (how can it be used?). In any case, should we add a line like "You almost certainly don't want to use this command directly, use `$0 shell` instead" to the help text?
- lib/config/export.go line 124: why is there a "Containers.ShellAccess" variable here, in addition to the specific ones for Admin and User? I don't see it used anywhere.
Otherwise, LGTM. I tried this version of the code and it worked (on ce8i5), so that was great. I also like the code comments you added here and there, thanks for that.
Updated by Tom Clegg almost 4 years ago
17170-container-shell @ 77c8223f5ddd64cff2b08d0857749644c474946f -- developer-run-tests: #2288
(merged master)
Updated by Peter Amstutz almost 4 years ago
- Target version changed from 2021-02-03 Sprint to 2021-02-17 sprint
Updated by Tom Clegg almost 4 years ago
Ward Vandewege wrote:
error setting up tunnel: server did not provide a tunnel: Path not found (req-1l9riyl6lyahu17kpyaj) (HTTP 404)
I'm guessing this is the usual "permission denied = not found" thing (the container exists but you're not allowed to know it exists). I wonder if we should fix this back at the controller by having all of the "get/update/delete one object" methods return a message like "object with that UUID does not exist, or you lack permission to read it" with any 404. (I don't think it should be every client's job to explain that.)
This might also happen if controller is an old version that doesn't have the connect endpoint. Should we cater to that case?
$ ./arvados-client connect-ssh ce8i5-dz642-xrck78umjn2xeri
error setting up tunnel: server did not provide a tunnel: container is running but gateway is not available (HTTP 503)
Changed to "container is running but gateway is not available -- installation problem or feature not supported". I think this means either the container started before a-d-c was upgraded, or the cluster is using crunch-dispatch-slurm which doesn't support this.
$ ./arvados-client connect-ssh ce8i5-dz642-5whvuvuqnqb88r6
error setting up tunnel: server did not provide a tunnel: gateway is not available, container is locked (HTTP 503)
Changed to "container is not running yet (state is "Locked")" (if Queued/Locked)
...or "container has ended (state is "Cancelled")" (if Complete/Cancelled)
- About the `connect-ssh` subcommand for arvados-client: ss that expected to be useful in a more general way to an end user? If so, we should probably add some more information about that (how can it be used?). In any case, should we add a line like "You almost certainly don't want to use this command directly, use `$0 shell` instead" to the help text?
Good point. I don't think it's likely to be useful to anyone but the "shell" command. Added:
NOTE: You almost certainly don't want to use this command directly. It is meant to be used internally. Use "arvados-client shell" instead.
- lib/config/export.go line 124: why is there a "Containers.ShellAccess" variable here, in addition to the specific ones for Admin and User? I don't see it used anywhere.
That's not a real config key, it's just an oddity with the way we check the export whitelist: in order to whitelist Containers.ShellAccess.User, we need to also whitelist Containers and Containers.ShellAccess.
17170-container-shell @ d106841c5b3e10f40c903923ea3f6121985d5502 -- developer-run-tests: #2299
Updated by Ward Vandewege almost 4 years ago
Tom Clegg wrote:
Ward Vandewege wrote:
error setting up tunnel: server did not provide a tunnel: Path not found (req-1l9riyl6lyahu17kpyaj) (HTTP 404)
I'm guessing this is the usual "permission denied = not found" thing (the container exists but you're not allowed to know it exists). I wonder if we should fix this back at the controller by having all of the "get/update/delete one object" methods return a message like "object with that UUID does not exist, or you lack permission to read it" with any 404. (I don't think it should be every client's job to explain that.)
Agree, that would be great.
This might also happen if controller is an old version that doesn't have the connect endpoint. Should we cater to that case?
Nah.
$ ./arvados-client connect-ssh ce8i5-dz642-xrck78umjn2xeri
error setting up tunnel: server did not provide a tunnel: container is running but gateway is not available (HTTP 503)Changed to "container is running but gateway is not available -- installation problem or feature not supported". I think this means either the container started before a-d-c was upgraded, or the cluster is using crunch-dispatch-slurm which doesn't support this.
OK! Thanks.
$ ./arvados-client connect-ssh ce8i5-dz642-5whvuvuqnqb88r6
error setting up tunnel: server did not provide a tunnel: gateway is not available, container is locked (HTTP 503)Changed to "container is not running yet (state is "Locked")" (if Queued/Locked)
...or "container has ended (state is "Cancelled")" (if Complete/Cancelled)
Great.
- About the `connect-ssh` subcommand for arvados-client: ss that expected to be useful in a more general way to an end user? If so, we should probably add some more information about that (how can it be used?). In any case, should we add a line like "You almost certainly don't want to use this command directly, use `$0 shell` instead" to the help text?
Good point. I don't think it's likely to be useful to anyone but the "shell" command. Added:
[...]
Excellent, thanks.
- lib/config/export.go line 124: why is there a "Containers.ShellAccess" variable here, in addition to the specific ones for Admin and User? I don't see it used anywhere.
That's not a real config key, it's just an oddity with the way we check the export whitelist: in order to whitelist Containers.ShellAccess.User, we need to also whitelist Containers and Containers.ShellAccess.
Ah, thanks for explaining.
17170-container-shell @ d106841c5b3e10f40c903923ea3f6121985d5502 -- developer-run-tests: #2299
LGTM! Exciting.
Updated by Tom Clegg almost 4 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|f30c8ed35e3e1ad7cb3cb51fc6d83f56a04ae8de.
Updated by Nico César almost 4 years ago
- Related to Bug #17398: [crunch-dispatch-local] [crunch-run] error starting gateway server: missing port in address added