Project

General

Profile

Actions

Bug #7263

closed

[Crunch] crunch-job does not cancel jobs consistently

Added by Peter Amstutz over 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/23/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Original bug report

c97qk-8i9sb-rjmjzrz54gf61os on c97qk was cancelled, but crunch-job doesn't notice and lets the job continue running.

We also saw this recently with a job on su92l.

Diagnosis I

The main loop of crunch-job has this code:

    my $gotsome
    = readfrompipes ()
    + reapchildren ();
    if (!$gotsome)
    {
      check_refresh_wanted();
      check_squeue();
      update_progress_stats();
      select (undef, undef, undef, 0.1);
    }

check_refresh_wanted is the function that will notice if the job has been canceled and the child processes signaled accordingly. If the cancel comes in while we're in this loop, we're unlikely to notice it as long as we keep receiving data from the child processes—which seems pretty likely now that we have crunchstat in the mix.

This diagnosis is not a 100% sure thing, because the pipes we're reading from are opened in nonblocking mode. So they would have to be very busy to keep avoiding this branch. But it's at least still possible, particularly since this problem seems to strike jobs that have many parallel tasks.

Diagnosis II

See Fix II below.

Fix I

Suggest checking refresh_trigger (and the other things) every ~two seconds, even during times when readfrompipes() and reapchildren() are keeping us 100% busy.

-    if (!$gotsome) {
+    if (!$gotsome || $latest_refresh + 2 < scalar time) {

After this is merged and deployed, test whether you can successfully cancel massively parallel jobs. If it works consistently, I think we can mark this ticket resolved. Otherwise, we'll need to do another pass on this.

Fix II

In source:sdk/cli/bin/crunch-job, do not stay in this loop forever:
  •     while (0 < sysread ($reader{$job}, $buf, 8192))
        {
          print STDERR $buf if $ENV{CRUNCH_DEBUG};
          $jobstep[$job]->{stderr_at} = time;
          $jobstep[$job]->{stderr} .= $buf;
          preprocess_stderr ($job);
          if (length ($jobstep[$job]->{stderr}) > 16384)
          {
            substr ($jobstep[$job]->{stderr}, 0, 8192) = "";
          }
          $gotsome = 1;
        }
    
  • Consider raising 8192 -- perhaps 65536.
  • Consider changing that "while" to an "if".

Subtasks 2 (0 open2 closed)

Task #8261: Review 7263-better-busy-behaviorResolvedPeter Amstutz01/23/2016

Actions
Task #8459: Review last commit 0f7709af on wtsi-hgi:hgi/7263-even-better-busy-behaviorResolvedTom Clegg02/17/2016

Actions
Actions

Also available in: Atom PDF