Bug #6598
closed
[Crunch] Fix crunch-job's update_progress_stats post-5717
Added by Brett Smith over 9 years ago.
Updated over 9 years ago.
Estimated time:
(Total: 0.00 h)
Description
crunch-job's update_proress_stats function updates the job's tasks summary. It assumes that the number of running jobs is the total number of SLURM slots available to the job, minus the number of slots unused (because they're free or being held due to node failures). After #5717, this math is no longer accurate: when few tasks exist at a level, crunch-job may use few a limited number of slots at that level. The math expects those slots are running jobs, but they're not.
Update the function to calculate a new "running" number based on a more accurate measure, like maybe scalar(keys(%proc))
.
- Target version changed from 2015-08-19 sprint to 2015-08-05 sprint
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
Tested a76d715 on 4xphq.
Before:
https://workbench.4xphq.arvadosapi.com/collections/58f9d7718475a24e87f73e413b1477d9+85/4xphq-8i9sb-04u5bq3yrmkvhrs.log.txt
2015-07-31_16:58:49 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 start level 0 with 1 slots
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 status: 0 done, 7 running, 1 todo
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 job_task 4xphq-ot0gb-m6kfhcshuenxvir
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 child 17941 started on compute1.1
2015-07-31_16:58:50 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 stderr starting: ['srun','--nodelist= .....
2015-07-31_16:58:51 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 status: 0 done, 8 running, 0 todo
2015-07-31_16:58:51 4xphq-8i9sb-04u5bq3yrmkvhrs 14689 0 stderr Running [docker.io run .....
After:
https://workbench.4xphq.arvadosapi.com/collections/403a43f6261ca34a0a84d0dc6b153dea+85/4xphq-8i9sb-58u6wekhujgxur9.log.txt
2015-07-31_17:08:48 4xphq-8i9sb-58u6wekhujgxur9 7793 start level 0 with 1 slots
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 status: 0 done, 0 running, 1 todo
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 job_task 4xphq-ot0gb-z37h8tgwgggqo7p
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 child 8189 started on compute1.1
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 stderr starting: ['srun','--nodelist= ......
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 status: 0 done, 1 running, 0 todo
2015-07-31_17:08:49 4xphq-8i9sb-58u6wekhujgxur9 7793 0 stderr Running [docker.io run .......
a76d715 is good to merge. Thank you.
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:6988f4d44d2f8f7fc4aa2c381334c44d3133cf31.
Also available in: Atom
PDF