Bug #4889
closed[Crunch] Logging with a progress bar deadlocks on output
Added by Bryan Cosca about 10 years ago. Updated about 10 years ago.
100%
Description
I ran the cufflinks tool: http://cole-trapnell-lab.github.io/cufflinks/manual/ and tools logs its own progress by creating a loading bar and updating that loading bar by the millisecond(?). I ran this job: qr1hi-8i9sb-ajgkcn4cuk4cntm and it last logged at the part before the loading bar would come up. The job then ran ad infinitum and I had to cancel it. (The job ran 20 minutes here: qr1hi-8i9sb-pmu3ffh19dny625 using a --quiet flag)
Possible solution: use communicate.Popen
instead of subprocess.call
. https://docs.python.org/2/library/subprocess.html#subprocess.call "Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes."
Files
logging_addendum_2.gif (247 KB) logging_addendum_2.gif | Bryan Cosca, 12/30/2014 04:40 PM | ||
logging_addendum_3.gif (245 KB) logging_addendum_3.gif | Bryan Cosca, 12/30/2014 04:47 PM |
Updated by Bryan Cosca about 10 years ago
- File logging_addendum_3.gif logging_addendum_3.gif added
- Description updated (diff)
Updated by Bryan Cosca about 10 years ago
- Subject changed from Logging with a progress bar to Cufflinks: Logging with a progress bar
Updated by Tim Pierce about 10 years ago
- Subject changed from Cufflinks: Logging with a progress bar to [Crunch] Logging with a progress bar deadlocks on output
- Description updated (diff)
- Category set to Crunch
- Target version set to Bug Triage
Updated by Tim Pierce about 10 years ago
- Target version changed from Bug Triage to 2015-01-28 Sprint
Updated by Tim Pierce about 10 years ago
- Assigned To set to Tim Pierce
- Story points set to 1.0
Updated by Tim Pierce about 10 years ago
So subprocess.Popen.communicate()
has the interesting effect that it's really only intended to communicate with a single subprocess. e.g.
pipe = subprocess.Popen('/bin/foo', '-a', '--lots-of-output', stdin=subprocess.PIPE) (stdoutdata, stderrdata) = pipe.communicate()
The communicate()
method sends whatever input data is necessary to the subprocess for the pipe and then collects output and error data, returning only when the process has terminated. If the output of this pipe goes to another pipe, you're out of luck -- communicate()
doesn't really know how to manage multiple pipes at once. See http://stackoverflow.com/questions/6341451/piping-together-several-subprocesses
(However, if the first process in the pipeline doesn't require anything on standard input, we may be able to get away with just calling .communicate()
on the last process in the pipeline. Let's try.)
One of the answers to that Stackoverflow suggests using https://docs.python.org/2/library/pipes.html which includes this reassuring note:
Because the module uses /bin/sh command lines, a POSIX or compatible shell for
os.system()
andos.popen()
is required.
Updated by Tom Clegg about 10 years ago
Hm, are we using communicate()
someplace where this progress bar could be going? source:crunch_scripts/run-command#L350 seems to be using Popen
already. And if we were running commands with communicate()
, we wouldn't be accustomed to seeing their stderr until they exited, right?
The crunchstat messages show CPU usage dropping off suddenly at about 22 minutes in, followed by the disappearance of the task's docker container ("reading stats from /sys/fs/cgroup/memory/memory.stat"). Is it possible that docker died of line buffering? Or that docker died of unrelated causes, and re-running the job with the progress bar would have worked just as well? It also seems possible it's crunchstat's fault (see CopyPipeToChan), but the container disappearance seems like a clue worth investigating. "Lots of stderr with no \n" seems like an easy crunch script to write & add to our Pathological Scripts repository.
In the meantime, another workaround suggestion: cufflinks has a --verbose
flag, for "log-friendly verbose processing". Possibly more appropriate than --quiet
here.
Updated by Tim Pierce about 10 years ago
Have not been able to reproduce this problem by running a script generating 70MB of output without a newline as a local job:
#! /usr/bin/env python # Test #4889: "Logging with a progress bar deadlocks on output" import sys for i in range(0,70000000): sys.stdout.write('x')
... run with this job definition:
{ "repository": "", "script": "test_4889.py", "script_version": "/home/twp/test4889", "script_parameters": {}, "runtime_constraints": { "docker_image": "bcosc/lobster" } }
When this script is run as a local job, the job terminates within a few minutes (with a "failed" status) and does not deadlock.
I have attempted to rerun the original failing job on qr1hi, but reproducing that will require fixing #5006 first.
Updated by Tim Pierce about 10 years ago
The original job continues to deadlock in production, e.g. qr1hi-8i9sb-6ja93cyxd7mjm1r
I will try reproducing with a garbage-generating dummy script run on a compute node and not locally.
Updated by Tim Pierce about 10 years ago
Job qr1hi-8i9sb-7b7seh3jgk8zoe1 doesn't deadlock either (Python script generating 16MiB of output, no newlines, in the same Docker image as used for the original job)
Updated by Tim Pierce about 10 years ago
- Status changed from New to In Progress
Updated by Tim Pierce about 10 years ago
Successfully reproduced in qr1hi-d1hrv-dftjw06d6rkjc2w. The problem is that the process deadlocks on the stderr buffer, not on stdout:
(pysdk)hitchcock:/home/twp/twp-qr1hi/crunch_scripts% git show 57c73203:crunch_scripts/test_4889.py #! /usr/bin/env python # Test #4889: "Logging with a progress bar deadlocks on output" # # Theory: the crunch job deadlocks because it waits for the job # to terminate before reading output, and certain jobs fill up # the output line buffer by producing huge amounts of data # without a newline. import sys for i in range(0,16000000): sys.stderr.write('x')
Updated by Tim Pierce about 10 years ago
And this looks likely to be the culprit: source:services/crunchstat/crunchstat.go@5870334b03237f04df6de35193f78b838f7ac7dc#L38
crunchstat collects stderr output from the container with CopyPipeToChan
, which reads stderr data with a bufio.Scanner
that defaults to reading records line-by-line.
The documentation for bufio.NewScanner
claims that it will stop scanning upon encountering "a token too large for the buffer," which suggests that it should actually terminate on this case and not hang this way. I'll see if I can reproduce on the command line, or if I can eliminate the issue by using a bufio.Reader
.
Updated by Tim Pierce about 10 years ago
Ready for review at 4c731f759.
It's a workaround, not a comprehensive fix: close the input pipe before returning, and if bufio.Scanner.Scan()
returns with a non-nil error message, report that to the log channel. Includes a unit test.
Updated by Tom Clegg about 10 years ago
This fixes the "stderr goes silent" problem and (probably?) kills the child by closing the pipe, which is definitely better than deadlocking the child.
I think the error message should be prefixed with"crunchstat: "
like crunchstat's other logs, so it's more obvious (to someone who isn't already familiar with crunchstat internals) who's saying it. Suggest:
...("crunchstat: line-buffering error: %s", s.Err())
Other than that, lgtm
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:38e27663cf656f0c9c443a2715f249afe39a8bfb.