Project

General

Profile

Actions

Bug #11097

closed

[API] Reuse containers even when multiple matching containers exist with differing outputs

Added by Tom Clegg almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
02/13/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Background

Sometimes, running the same container twice on the same inputs can result in two successes with two different outputs. This can mean a number of things, including
  • undetected failure in one or both cases, perhaps resulting in bogus output
  • both outputs are correct, but have non-meaningful differences (like an "output produced at {timestamp}" comment in an output file)

The second case is common in practice.

Currently, the API server disables the container re-use logic entirely when it detects that two re-use candidates produced different outputs. This causes the following undesirable pattern:
  1. Run container "X" as part of a workflow w1
  2. Re-use container "X" automatically in subsequent workflows w2..w5, saving time
  3. Run workflow w4 with re-use disabled, e.g., to get runtime stats or verify reproducibility -- this runs container "X1" which is identical to "X" but produces different (but still correct) output
  4. Run workflow w5..w9 with re-use enabled
  5. Oops, even when re-running workflow w5, container "X" is not eligible for reuse ever again, because "X1" exists.

Desired behavior

Use the oldest matching container whose output and log collections exist, aren't trashed, and are readable by the current user.

If we used the newest matching container, we would have the following problem:
  1. Run container X, producing out1
  2. Run workflows w1..w9 that reuse X and do a lot of downstream work on out1
  3. Re-run workflows w1..w9 → lots of reused containers
  4. Re-run container X1, producing out2
  5. Re-run workflows w1..w9 → arvados chooses X1 now, so all downstream work has to be redone
Using the oldest matching container fixes the problems given above, while admitting the converse problem:
  1. Run container "X"
  2. Notice that container "X" exited 0 but produced bogus output because of a bug in the container process or Arvados itself
  3. Run container again with re-use disabled: "X1" produces correct output
  4. Run a workflow that makes use of this container
  5. Oops, the workflow gets the bogus "X" output instead of the newer "X1" output

This is the lesser evil in that re-running the same container -- i.e., without fixing the underlying problem that allowed it to exit 0 with bogus output -- is not a viable solution anyway.

Implementation

Disable this check in source:services/api/app/models/container.rb

    if outputs.count.count != 1
      Rails.logger.debug("Found #{outputs.count.length} different outputs")

Subtasks 2 (0 open2 closed)

Task #11140: Update testsResolvedTom Clegg02/13/2017

Actions
Task #11111: Review 11097-reuse-impureResolvedRadhika Chippada02/13/2017

Actions
Actions #1

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #4

Updated by Tom Clegg almost 8 years ago

11097-reuse-impure @ 264ffa31bae106bb6c36643e13186289b6cd0e18

...fails a few tests -- but perhaps only because it changes the behavior as intended.

Actions #5

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Tom Clegg almost 8 years ago

  • Target version set to 2017-02-15 sprint
Actions #7

Updated by Tom Clegg almost 8 years ago

  • Target version changed from 2017-02-15 sprint to Arvados Future Sprints
Actions #8

Updated by Tom Clegg almost 8 years ago

  • Target version changed from Arvados Future Sprints to 2017-03-01 sprint
Actions #9

Updated by Tom Clegg almost 8 years ago

  • Assigned To set to Tom Clegg
Actions #10

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #11

Updated by Tom Clegg almost 8 years ago

Actions #12

Updated by Radhika Chippada almost 8 years ago

  • I think moving “select_readable_pdh” to the line above the declaration of “candidates” at line 85 would help improve readability since the rest of the clauses are building on "candidates"
  • We talked about potentially removing output or log on the oldest completed container, if it is not desirable that it be reused. However, it appears that the output or log on a container in completed state can no longer be updated. So how can this be done? Do you mean that either one of these be removed from keep? Do we need to add a blurb about this also in the above documentation?
Actions #13

Updated by Tom Clegg almost 8 years ago

Radhika Chippada wrote:

  • I think moving “select_readable_pdh” to the line above the declaration of “candidates” at line 85 would help improve readability since the rest of the clauses are building on "candidates"

Indeed, rearranged this.

Updated, thanks.

  • We talked about potentially removing output or log on the oldest completed container, if it is not desirable that it be reused. However, it appears that the output or log on a container in completed state can no longer be updated. So how can this be done? Do you mean that either one of these be removed from keep? Do we need to add a blurb about this also in the above documentation?

Yes, trashing the output or log collection would accomplish this. I added to the docs "...whose log and output collection are still available". Documenting the "poking re-use in the eye" procedure seems worthwhile too but it's more of a workflow trick than API documentation -- e.g., you could make use of that information even if you only use Workbench and don't know what an API is. Wiki?

802af81e13dd11a7f2d9796a2ada8faf3b722477

Actions #14

Updated by Radhika Chippada almost 8 years ago

Yes, trashing the output or log collection would accomplish this ... "poking re-use in the eye" procedure seems worthwhile too but it's more of a workflow trick than API documentation -- e.g., you could make use of that information even if you only use Workbench and don't know what an API is. Wiki?

I'd imagine someone would ask how to do this in no time. So, please add a note wherever you think appropriate. Thanks.

LGTM

Actions #15

Updated by Tom Clegg almost 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:0c529ed05805507b4d2c903b9587e9b61cec5ee6.

Actions

Also available in: Atom PDF