Bug #12246
closed[Crunch] Better crunch-run error when command not found
100%
Description
If a container specifies a command not found, or specifies a script with a #! line that isn't found, the error is very cryptic. It should provide a better error message.
Updated by Tom Morris over 7 years ago
- Tracker changed from Story to Bug
- Target version set to 2017-09-27 Sprint
At a minimum, the error message should include a quoted & escape version of the program that it is attempting to run and didn't find.
Updated by Tom Clegg over 7 years ago
- Category set to Crunch
- Status changed from New to In Progress
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg over 7 years ago
When the command doesn't exist, the error message isn't bad:
$ arv container_request create --container-request '{"command":["foobar"],"container_image":"arvados/jobs","output_path":"/out","state":"Committed","runtime_constraints":{"vcpus":1,"ram":1000000},"priority":1,"mounts":{"/out":{"kind":"tmp","capacity":1000000}}}'
↓
2017-09-20T20:39:14.037657193Z exec: "foobar": executable file not found in $PATH 2017-09-20T20:39:15.067555258Z could not start container: Error response from daemon: Cannot start container 58099cd76c834f3dc2a4fb76c8028f049ae6d4fdf0ec373e1f2cfea030670c2d: [8] System error: exec: "foobar": executable file not found in $PATH 2017-09-20T20:39:15.067632751Z Cancelled 2017-09-20T20:39:15.581835Z Container 9tee4-dz642-oukm4tdpxpl67dx was cancelled
Updated by Tom Clegg over 7 years ago
Using a docker image with "#!/bin/durrgh" in /bin/fail, the message is more obscure:
$ arv container_request create --container-request '{"command":["/bin/fail"],"container_image":"fail","output_path":"/out","state":"Committed","runtime_constraints":{"vcpus":1,"ram":1000000},"priority":1,"mounts":{"/out":{"kind":"tmp","capacity":1000000}}}'
2017-09-20T20:50:25.549086754Z Starting Docker container id '41f26cbc43bcc1280f4323efb1830a394ba8660c9d1c2b564ba42bf7f7694845' 2017-09-20T20:50:29.091406142Z could not start container: Error response from daemon: Cannot start container 41f26cbc43bcc1280f4323efb1830a394ba8660c9d1c2b564ba42bf7f7694845: [8] System error: no such file or directory 2017-09-20T20:50:29.091468975Z Cancelled 2017-09-20T20:50:29.636949Z Container 9tee4-dz642-bi9yzjmqqlqrnj9 was cancelled
Also tried "#!/bin/durrgh\r\ndurrgh\r\n", with the same result.
Updated by Peter Amstutz over 7 years ago
Tom Clegg wrote:
So if container startup fails, we should make sure to report the command being invoked...
Maybe the most helpful thing to add to the "could not start container" error is a hint about a known cause of that error:
[...]
..., and if it is located on a keep mount, see about reading the first line of the file to report that path and check for the Windows newline issue.
I feel like predicting which file(s) will be executed by a given command array will be hard to get right (e.g., what will $PATH be inside the container?), and even getting it right sometimes might not be worth the trouble...
Here's my proposed behavior if startup fails (for any reason, since trying to sniff out exact error seems like an exercise in frustration):
- Report the first item in the command array
- If the first item in the command array is located on a keep mount, check if the first two bytes are #!, if so read the first line and report it, and check for a Windows newline.
Updated by Tom Clegg over 7 years ago
The panic was a runc bug, fixed here: https://github.com/opencontainers/runc/pull/1117
Inside runc the panic was then converted to an error that includes a stack trace. So the effect of the runc fix is just to reduce the message from "error + stack trace" to just "error".
Updated by Tom Clegg over 7 years ago
I'm still not going to take apart $PATH and transform paths and symlinks to figure out what exec() would do in the container. That kind of fix will just come with its own bugs, etc.
Agree with note-2 and note-10 that reporting the first item in the command array seems helpful. When the command itself doesn't exist, the missing command is already mentioned (twice!) in the error message so adding it a third time doesn't seem compelling. But where bash gives a "bad interpreter" error mentioning the bad interpreter, docker is somewhat coy.
$ /tmp/bogus -bash: /tmp/bogus: /bin/nooooo: bad interpreter: No such file or directory $ docker run -it --rm 1b044b40475d /bin/bogus standard_init_linux.go:178: exec user process caused "no such file or directory"
So in this case we add a hint.
fmt.Sprintf(" (perhaps command %q is missing, or has a missing #! interpreter, or was saved in DOS mode with cr-lf chars?)", runner.Container.Command[0])
12246-command-not-found @ deb14a7264ed4a07d154504991447c3be8413db7
Updated by Tom Clegg over 7 years ago
Just to clarify about the runc panic stack trace: it seems the stack trace is not a crash, it's just a verbose error message from docker. It does include the "no such file or directory" string, so if you use a pre-bugfixed docker, you'll benefit from this new "suggest checking #!" feature, although the suggestion will be a bit harder to see above the giant wall of stack trace.
Updated by Ward Vandewege over 7 years ago
- Subject changed from Better crunch-run error when command not found to [Crunch] Better crunch-run error when command not found
Updated by Peter Amstutz over 7 years ago
It runs together on a very long line, which makes it hard to read. Could the "advice" come after the error message on a separate line?
2017-09-27T16:45:00.811779771Z crunch-run Starting Docker container id '316d454ad4bf0864a2daaa0357201a2e27382158469c33010fb5c2900708500c' 2017-09-27T16:45:00.990318516Z stderr container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 2017-09-27T16:45:01.136591766Z crunch-run could not start container (perhaps command "/does/not/exists" is missing, or has a missing #! interpreter, or was saved in DOS mode with cr-lf chars?): Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 2017-09-27T16:45:01.136607933Z crunch-run Cancelled
Updated by Anonymous over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:91143ef549e065ebdfb0138a031fc1fbd65cb527.
Updated by Peter Amstutz over 7 years ago
12246-better-advice:
2017-09-27T18:17:07.782746483Z stderr container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 2017-09-27T18:17:07.930777304Z crunch-run could not start container: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/does/not/exists\": stat /does/not/exists: no such file or directory" 2017-09-27T18:17:07.930777304Z crunch-run Possible causes: command "/does/not/exists" is missing, the interpreter given in #! is missing, or script has Windows line endings. 2017-09-27T18:17:07.930798102Z crunch-run Cancelled