Bug #11590
closedjob reuse / re-use logic does not log decisions
100%
Description
It would be helpful if job reuse logic produced some log output to help understand why job output was or was not reused.
For example, we might have entries such as:
"no candidates for job reuse"
"no candidates for job reuse after filtering"
"job reuse candidates had inconsistent output, running a new job"
"job reuse candidate found, using existing output from xxxxx-xxxxx-xxxxxxxxxxx"
Ideally this might be included in crunch-dispatch logs (or even in some cases in job logs?), but as the decisions seem to be made by the API server, they could at least be logged there (i.e. in production.log).
Updated by Tom Clegg over 7 years ago
For crunch1 and crunch2: add a config flag "log_reuse_decisions". If true, in the "find reusable" code, build the ActiveRelation one where() at a time, and log a count() after each one.
For crunch1 only: add a "reuse_job_if_outputs_differ" flag, default false. If true, return "chosen" as soon as it's chosen, instead of waiting for a conflicting-output job.
Updated by Tom Clegg over 7 years ago
- Status changed from New to In Progress
Example logs
find_reusable: have 29 readable jobs in a reusable state find_reusable: have 2 candidates after filtering on script_parameters_digest 03a43a7d84f7fb022467b876c2950acd find_reusable: have 2 candidates after filtering on !nondeterministic find_reusable: have 2 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["docker_image_locator", "=", nil]] find_reusable: job zzzzz-8i9sb-aceg2bnq7jt7kon output fa7aeb5140e2848d39b416daeef4ffc5+45 can be reused; continuing search in case others disagree find_reusable: job zzzzz-8i9sb-nj8ioxnrvjtyk2b output aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa+1 disagrees; forgetting about zzzzz-8i9sb-aceg2bnq7jt7kon and ignoring any other finished jobs (see reuse_job_if_outputs_differ in application.default.yml) find_reusable: done, nothing suitable
Updated by Tom Clegg over 7 years ago
# Include details about job reuse decisions in the server log. This
# causes additional database queries to run, so it should not be
# enabled unless you expect to examine the resulting logs for
# troubleshooting purposes.
log_reuse_decisions: false
# Control job reuse behavior when two completed jobs match the
# search criteria and have different outputs.
#
# If true, in case of a conflict, reuse the earliest job (this is
# similar to container reuse behavior).
#
# If false, in case of a conflict, do not reuse any completed job,
# but do reuse an already-running job if available (this is the
# original job reuse behavior, and is still the default).
reuse_job_if_outputs_differ: false
Updated by Tom Clegg over 7 years ago
11590-log-reuse @ fb5d279228a7adcafd858d9e137accd8010bb382
Updated by Radhika Chippada over 7 years ago
- reuselog => log_reuse_decision or log_reuse_plan?
- The following seems like it could be an extremely confusing log message. This is logging the total number of jobs that are Complete or Queued or Running (which could be very large number) without any consideration at all of any of the search criteria given, of which all of them can be completely unrelated. It doesn't seem like we should be logging this. Otherwise, please consider rewording something like "there are 'n' number of jobs readable that are either complete or queued or running, and will check if any of them can be candidates for reuse ..."
candidates = Job.readable_by(current_user).where( 'state = ? or (owner_uuid = ? and state in (?))', Job::Complete, current_user.uuid, [Job::Queued, Job::Running]) reuselog { "have #{candidates.count} readable jobs in a reusable state" }
- "job #{j.uuid} is reusable, but unfinished, so not better than #{incomplete_job.uuid}; ignoring" => another unfinished and reusable job #{j.uuid}, but not better than ... ?
- "job #{j.uuid} output #{j.output} ignored, see above" => "job #{j.uuid} WITH output ..."
- "job #{j.uuid} output #{j.output} disagrees ... " => "job #{j.uuid} WITH output ..."
- " ... continuing search in case others disagree" => continuing search in case conflicting jobs and outputs exist?
- "job #{j.uuid} output #{j.output} unavailable to user; now refusing to reuse any finished job" => this a little confusing. should we say due to the Rails.configuration.reuse_job_if_outputs_differ setting, we will not be reusing any of the completed jobs?
- "done, #{j.uuid} was selected" and "done, nothing suitable" => can be a bit more verbose. something like " done searching for a reusable job and found / not fount ..."
- Rails.configuration.reuse_job_if_outputs_differ => reuse_job_EVEN_WHEN_outputs_differ ?
Updated by Tom Clegg over 7 years ago
Updated log messages:
find_reusable: starting with 34 jobs readable by current user zzzzz-tpzed-xurymjxw79nv3jz find_reusable: have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user) find_reusable: have 4 candidates after filtering on script_parameters_digest a5f03bbfb8ba88a2efe4a7852671605b find_reusable: have 3 candidates after filtering on !nondeterministic find_reusable: have 1 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["script_version", "=", "4fe459abe02d9b365932b8f5dc419439ab4e2577"], ["docker_image_locator", "=", nil]] find_reusable: job zzzzz-8i9sb-cjs4pklxxjykqqq with output ea10d51bcf88862dbcc36eb292017dfd+45 can be reused; continuing search in case other candidates have different outputs find_reusable: done, zzzzz-8i9sb-cjs4pklxxjykqqq was selected
find_reusable: starting with 34 jobs readable by current user zzzzz-tpzed-xurymjxw79nv3jz find_reusable: have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user) find_reusable: have 3 candidates after filtering on script_parameters_digest 174dd339d44f2b259fadbab7ebdb8df9 find_reusable: have 3 candidates after filtering on !nondeterministic find_reusable: have 2 candidates after filtering on repo, script, and custom filters [["repository", "=", "active/foo"], ["script", "=", "hash"], ["script_version", "=", "4fe459abe02d9b365932b8f5dc419439ab4e2577"], ["docker_image_locator", "=", nil]] find_reusable: job zzzzz-8i9sb-ahd7cie8jah9qui with output ea10d51bcf88862dbcc36eb292017dfd+45 can be reused; continuing search in case other candidates have different outputs find_reusable: job zzzzz-8i9sb-cjs4pklxxjykppp output disagrees; forgetting about zzzzz-8i9sb-ahd7cie8jah9qui and ignoring any other finished jobs (see reuse_job_if_outputs_differ in application.default.yml) find_reusable: done, nothing suitable
Updated by Radhika Chippada over 7 years ago
have 29 candidates after filtering on job state (either state=Complete, or state=Queued/Running and submitted by current user)
I am not sure if it is clear enough that the "submitted by current user" is applicable only to state=Queued/Running or state=Complete as well, without reading it multiple times. Can we say (either state=Complete or submitted by current user in state=Queued/Running to be clear?
LGTM
Updated by Tom Clegg over 7 years ago
updated to
"have #{candidates.count} candidates after filtering on job state ((state=Complete) or (state=Queued/Running and (submitted by current user)))"
Updated by Tom Clegg over 7 years ago
Container reuse logging samples
find_reusable: starting with 14 container records in database find_reusable: have 14 candidates after filtering on command ["echo", "hello"] find_reusable: have 14 candidates after filtering on cwd "test" find_reusable: have 2 candidates after filtering on environment {"var"=>"complete"} find_reusable: have 2 candidates after filtering on output_path "test" find_reusable: have 2 candidates after filtering on container_image "9ae44d5792468c58bcf85ce7353c7027+124" find_reusable: have 2 candidates after filtering on mounts {"test"=>{"kind"=>"json"}} find_reusable: have 2 candidates after filtering on runtime_constraints {"keep_cache_ram"=>268435456, "ram"=>12000000000, "vcpus"=>4} find_reusable: have 2 with state=Complete, exit_code=0 find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000 find_reusable: have 2 with output readable by current user zzzzz-tpzed-000000000000000 find_reusable: done, reusing completed container zzzzz-dz642-cwpvv44a5ys2pkj find_reusable: starting with 14 container records in database find_reusable: have 14 candidates after filtering on command ["echo", "hello"] find_reusable: have 14 candidates after filtering on cwd "test" find_reusable: have 2 candidates after filtering on environment {"var"=>"failed_vs_running"} find_reusable: have 2 candidates after filtering on output_path "test" find_reusable: have 2 candidates after filtering on container_image "9ae44d5792468c58bcf85ce7353c7027+124" find_reusable: have 2 candidates after filtering on mounts {"test"=>{"kind"=>"json"}} find_reusable: have 2 candidates after filtering on runtime_constraints {"ram"=>12000000000, "vcpus"=>4} find_reusable: have 0 with state=Complete, exit_code=0 find_reusable: have 0 with log readable by current user zzzzz-tpzed-000000000000000 find_reusable: have 0 with output readable by current user zzzzz-tpzed-000000000000000 find_reusable: done, reusing container zzzzz-dz642-dl6lxlw5l3m61z3 with state=Running find_reusable: starting with 12 container records in database find_reusable: have 0 candidates after filtering on command ["echo", "foo"] find_reusable: have 0 candidates after filtering on cwd "." find_reusable: have 0 candidates after filtering on environment {} find_reusable: have 0 candidates after filtering on output_path "/tmp" find_reusable: have 0 candidates after filtering on container_image "fa3c1a9cb6783f85f2ecda037e07b8c3+167" find_reusable: have 0 candidates after filtering on mounts {} find_reusable: have 0 candidates after filtering on runtime_constraints {"keep_cache_ram"=>268435456, "vcpus"=>1, "ram"=>1} find_reusable: have 0 with state=Complete, exit_code=0 find_reusable: have 0 with log readable by current user zzzzz-tpzed-000000000000000 find_reusable: have 0 with output readable by current user zzzzz-tpzed-000000000000000 find_reusable: have no containers in Running state find_reusable: have no containers in Locked or Queued state find_reusable: done, no reusable container found
11590-log-reuse @ d95ccc6bbbf94e957b939ca9d9aea71cda345944
Updated by Radhika Chippada over 7 years ago
- Do we want to log the entire command and mounts? Would the log messages become too big and unreadable? Can we instead print first few lines of these attributes?
- “find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000” => here is the current user going to be an actual end user (not the system user)?
- "have #{candidates.count} candidates after filtering on job state ((state=Complete) or (state=Queued/Running and (submitted by current user)))" => "have #{candidates.count} candidates after filtering on job state ((state = Complete) or (state = Queued/Running that are submitted by #{current_user.uuid}
(submitted by current user)))” might be easier to read and understand
LGTM
Updated by Tom Clegg over 7 years ago
Radhika Chippada wrote:
- Do we want to log the entire command and mounts? Would the log messages become too big and unreadable? Can we instead print first few lines of these attributes?
Punting on this -- this logging is only enabled while debugging, so limiting detail might do as much harm as good.
- “find_reusable: have 2 with log readable by current user zzzzz-tpzed-000000000000000” => here is the current user going to be an actual end user (not the system user)?
Updated this message to reflect the fact that permissions don't come into it (current_user is always system_user at this point).
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:a9b15a2f041744c8fac8c0b12fa8b61d1a7292f8.