Bug #8164
closedInconsistent CPU percentage units in job log graph
Description
CPU percentage in the job graphs appear to be broken for me when tasks use more than 100%. For example:
```Task 23 cpu: 2.6% (300.0002 seconds 308.3100 user 5.0400 sys)
Task 23 net:eth0: 0
...
Task 23 cpu: 98.4% (300.0001 seconds 289.8500 user 5.4800 sys)
Task 23 net:eth0: 0
```
By my calculations, the latter percentage is correct (289.85+5.48)/300.0001 = 98.4% but the former should be (308.3+5.04)/300.0002 = 104.4%. In this particular job, most tasks spend most time at around 105% utilisation, and appear as flat lines at around 2.5% on the job graph, except for occasional moments when they jump up near 100%. The fact that the Y-axis of the graph changes meaning from timepoint to timepoint and from data series to data series makes these graphs pretty much useless.
I think I have found the explanation for this (https://github.com/curoverse/arvados/blob/master/apps/workbench/app/assets/javascripts/job_log_graph.js#L44-53):
```
// special calculation for cpus
if( /-cpu$/.test(series) ) {
// divide the stat by the number of cpus unless the time count is less than the interval length
if( dsum.toFixed(1) > dt.toFixed(1) ) {
var cpuCountMatch = intervalMatch1.match(/(\d+) cpus/);
if( cpuCountMatch ) {
datum = datum / cpuCountMatch1;
}
}
}
```
This appears to have been introduced in November 2014 by: https://github.com/curoverse/arvados/commit/7ab6b64c5fa3b958752ecb22751630b6e1016bd1
This change does not make sense to me and I think it should be reverted.
I think the y-axis of the job graph should either be in units of single-core utilisation (meaning for an n-core machine it will potentially range up to n*100%), or it should always be in units of percentage of all available cores (so that it stays in the 0-100% range).
In the future with crunch v2 there may be a third option, which would be to have it be as a percentage of cores actually allocated to the task, but at the moment that is probably not a good idea, although I guess you could guess at it by taking (cores / max_tasks_per_node).
I would suggest the clearest option for end-users would be to not correct for the number of cores at all, but in that case you would need to all the graph to rescale itself appropriately (i.e. you'd need to revert 7ab6b64c5fa3b958752ecb22751630b6e1016bd1 but also revert the behaviour introduced in https://github.com/curoverse/arvados/commit/e9ccda58ac1b7334cfeee8ab23dd37d9bf3f534d).
The easiest to implement fix would probably be to always divide by the number of cores such that the percentage given is a percent of all cores available on the node. That should be a simple matter of reverting 7ab6b64c5fa3b958752ecb22751630b6e1016bd1. However, I'd like to stress that for our cluster in which every node has 40 cores that provides a terrible graph-viewing experience in which set of tasks running against ~1 core each appear as indistinguishable lines that vary between 1.3-2.6% (to indicate 50-100% of one core utilised).
Files
Updated by Joshua Randall about 10 years ago
- Subject changed from CPU percentage in job log graph to Inconsistent CPU percentage units in job log graph