Bug #4889: [Crunch] Logging with a progress bar deadlocks on output - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Bug #4889

closed

[Crunch] Logging with a progress bar deadlocks on output

Added by Bryan Cosca about 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tim Pierce

Category:

Crunch

Target version:

2015-01-28 Sprint

Start date:

12/30/2014

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Story points:

1.0

Description

I ran the cufflinks tool: http://cole-trapnell-lab.github.io/cufflinks/manual/ and tools logs its own progress by creating a loading bar and updating that loading bar by the millisecond(?). I ran this job: qr1hi-8i9sb-ajgkcn4cuk4cntm and it last logged at the part before the loading bar would come up. The job then ran ad infinitum and I had to cancel it. (The job ran 20 minutes here: qr1hi-8i9sb-pmu3ffh19dny625 using a --quiet flag)

Possible solution: use communicate.Popen instead of subprocess.call. https://docs.python.org/2/library/subprocess.html#subprocess.call "Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes."

Files

Download all files

logging_addendum_2.gif (247 KB) logging_addendum_2.gif		Bryan Cosca, 12/30/2014 04:40 PM
logging_addendum_3.gif (245 KB) logging_addendum_3.gif		Bryan Cosca, 12/30/2014 04:47 PM

Subtasks 1 (0 open — 1 closed)

Task #5042: Review 4889-crunch-deadlock

Resolved

Tim Pierce

12/30/2014

Actions

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Bryan Cosca about 10 years ago

File logging_addendum_3.gif logging_addendum_3.gif added
Description updated (diff)

Actions

Copy link

Updated by Bryan Cosca about 10 years ago

Subject changed from Logging with a progress bar to Cufflinks: Logging with a progress bar

Actions

Copy link

Updated by Tim Pierce about 10 years ago

Subject changed from Cufflinks: Logging with a progress bar to [Crunch] Logging with a progress bar deadlocks on output
Description updated (diff)
Category set to Crunch
Target version set to Bug Triage

Actions

Copy link

Updated by Tim Pierce about 10 years ago

Target version changed from Bug Triage to 2015-01-28 Sprint

Actions

Copy link

Updated by Tim Pierce about 10 years ago

Assigned To set to Tim Pierce
Story points set to 1.0

Actions

Copy link

Updated by Tim Pierce about 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Tim Pierce about 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Tim Pierce about 10 years ago

So subprocess.Popen.communicate() has the interesting effect that it's really only intended to communicate with a single subprocess. e.g.

pipe = subprocess.Popen('/bin/foo', '-a', '--lots-of-output', stdin=subprocess.PIPE)
(stdoutdata, stderrdata) = pipe.communicate()

The communicate() method sends whatever input data is necessary to the subprocess for the pipe and then collects output and error data, returning only when the process has terminated. If the output of this pipe goes to another pipe, you're out of luck -- communicate() doesn't really know how to manage multiple pipes at once. See http://stackoverflow.com/questions/6341451/piping-together-several-subprocesses

(However, if the first process in the pipeline doesn't require anything on standard input, we may be able to get away with just calling .communicate() on the last process in the pipeline. Let's try.)

One of the answers to that Stackoverflow suggests using https://docs.python.org/2/library/pipes.html which includes this reassuring note:

Because the module uses /bin/sh command lines, a POSIX or compatible shell for os.system() and os.popen() is required.

Actions

Copy link

Updated by Tom Clegg about 10 years ago

Hm, are we using communicate() someplace where this progress bar could be going? source:crunch_scripts/run-command#L350 seems to be using Popen already. And if we were running commands with communicate(), we wouldn't be accustomed to seeing their stderr until they exited, right?

The crunchstat messages show CPU usage dropping off suddenly at about 22 minutes in, followed by the disappearance of the task's docker container ("reading stats from /sys/fs/cgroup/memory/memory.stat"). Is it possible that docker died of line buffering? Or that docker died of unrelated causes, and re-running the job with the progress bar would have worked just as well? It also seems possible it's crunchstat's fault (see CopyPipeToChan), but the container disappearance seems like a clue worth investigating. "Lots of stderr with no \n" seems like an easy crunch script to write & add to our Pathological Scripts repository.

In the meantime, another workaround suggestion: cufflinks has a --verbose flag, for "log-friendly verbose processing". Possibly more appropriate than --quiet here.

Actions

Copy link

#10

Updated by Tim Pierce about 10 years ago

Have not been able to reproduce this problem by running a script generating 70MB of output without a newline as a local job:

#! /usr/bin/env python

# Test #4889: "Logging with a progress bar deadlocks on output" 

import sys

for i in range(0,70000000):
    sys.stdout.write('x')

... run with this job definition:

{
  "repository": "",
  "script": "test_4889.py",
  "script_version": "/home/twp/test4889",
  "script_parameters": {},
  "runtime_constraints": {
    "docker_image": "bcosc/lobster" 
  }
}

When this script is run as a local job, the job terminates within a few minutes (with a "failed" status) and does not deadlock.

I have attempted to rerun the original failing job on qr1hi, but reproducing that will require fixing #5006 first.

Actions

Copy link

#11

Updated by Tim Pierce about 10 years ago

The original job continues to deadlock in production, e.g. qr1hi-8i9sb-6ja93cyxd7mjm1r

I will try reproducing with a garbage-generating dummy script run on a compute node and not locally.

Actions

Copy link

#12

Updated by Tim Pierce about 10 years ago

Job qr1hi-8i9sb-7b7seh3jgk8zoe1 doesn't deadlock either (Python script generating 16MiB of output, no newlines, in the same Docker image as used for the original job)

Actions

Copy link

#13

Updated by Tim Pierce about 10 years ago

Status changed from New to In Progress

Actions

Copy link

#14

Updated by Tim Pierce about 10 years ago

Successfully reproduced in qr1hi-d1hrv-dftjw06d6rkjc2w. The problem is that the process deadlocks on the stderr buffer, not on stdout:

(pysdk)hitchcock:/home/twp/twp-qr1hi/crunch_scripts% git show 57c73203:crunch_scripts/test_4889.py
#! /usr/bin/env python

# Test #4889: "Logging with a progress bar deadlocks on output" 
#
# Theory: the crunch job deadlocks because it waits for the job
# to terminate before reading output, and certain jobs fill up
# the output line buffer by producing huge amounts of data
# without a newline.

import sys

for i in range(0,16000000):
  sys.stderr.write('x')

Actions

Copy link

#15

Updated by Tim Pierce about 10 years ago

And this looks likely to be the culprit: source:services/crunchstat/crunchstat.go@5870334b03237f04df6de35193f78b838f7ac7dc#L38

crunchstat collects stderr output from the container with CopyPipeToChan, which reads stderr data with a bufio.Scanner that defaults to reading records line-by-line.

The documentation for bufio.NewScanner claims that it will stop scanning upon encountering "a token too large for the buffer," which suggests that it should actually terminate on this case and not hang this way. I'll see if I can reproduce on the command line, or if I can eliminate the issue by using a bufio.Reader.

Actions

Copy link

#16

Updated by Tim Pierce about 10 years ago

Ready for review at 4c731f759.

It's a workaround, not a comprehensive fix: close the input pipe before returning, and if bufio.Scanner.Scan() returns with a non-nil error message, report that to the log channel. Includes a unit test.

Actions

Copy link

#17

Updated by Tom Clegg about 10 years ago

This fixes the "stderr goes silent" problem and (probably?) kills the child by closing the pipe, which is definitely better than deadlocking the child.

I think the error message should be prefixed with "crunchstat: " like crunchstat's other logs, so it's more obvious (to someone who isn't already familiar with crunchstat internals) who's saying it. Suggest:

...("crunchstat: line-buffering error: %s", s.Err())

Other than that, lgtm

Actions

Copy link

#18

Updated by Anonymous about 10 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:38e27663cf656f0c9c443a2715f249afe39a8bfb.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #4889

[Crunch] Logging with a progress bar deadlocks on output

Updated by Bryan Cosca about 10 years ago

Updated by Bryan Cosca about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tom Clegg about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tim Pierce about 10 years ago

Updated by Tom Clegg about 10 years ago

Updated by Anonymous about 10 years ago