Bug #4882
closed
[Crunch] crunchstat reports surprising CPU usage when container appears and disappears
Added by Bryan Cosca about 10 years ago.
Updated over 9 years ago.
Estimated time:
(Total: 0.00 h)
Description
Background¶
source:services/crunchstat/crunchstat.go has a list of locations where cgroup accounting files are likely to be found. Historically this was needed because the location varies from one kernel/system to another. Each time crunchstat reads stats, it tries each location in turn until it finds one that exists. When the target container is up, this works fine, but the last two locations give accounting information for the host instead of the container. Often, crunchstat tries to read accounting data before the container has come up or after the container has been destroyed (in the interval before it gets notified that the docker process has exited). Naturally, the transition between host stats and container stats results in reporting wild deltas.
The transition from one stat file to another causes an "error" log message which is easy to misinterpret as an error affecting the job, or explaining some later job failure -- see #5523.
Proposed fix¶
One or more of:
- Don't report host-level stats if container-level stats have ever been reported.
- Don't report host-level stats ever.
- When the previous sample point and the current sample point come from different accounting files, don't print a delta.
Bug report¶
Originally reported as "Log graph goes out of bounds":
about 5/6th of the way through, you can see the yellow point go off the screen. Suggestion: maybe the graph should scale in real time... Hmm, I'm not sure because then the graph would keep scaling and would probably get annoying. It would still be useful to at least see that point though.
Files
Related issues
1 (1 open — 0 closed)
- Target version set to Bug Triage
- Subject changed from Log graph goes out of bounds to [Workbench] Log graph goes out of bounds
- Category set to Workbench
The job in question is qr1hi-8i9sb-ajgkcn4cuk4cntm. At 2014-12-29_21:48:34, where previous CPU user values were 10 or below, suddenly it skyrocketed to over 4000. After that, samples were below 1 for the remainder of the job.
crunchstat apparently didn't miss a sample. The logs show them being reported every 10 seconds consistently.
I wouldn't rule out the possibility that this is a bug in the underlying data—either in crunchstat, or the records it reads.
- Subject changed from [Workbench] Log graph goes out of bounds to [Workbench] crunchstat reports surprising CPU usage when container appears and disappears
- Description updated (diff)
- Category changed from Workbench to Crunch
- Target version changed from Bug Triage to Arvados Future Sprints
- Subject changed from [Workbench] crunchstat reports surprising CPU usage when container appears and disappears to [Crunch] crunchstat reports surprising CPU usage when container appears and disappears
- Description updated (diff)
- Description updated (diff)
- Target version changed from Arvados Future Sprints to 2015-05-20 sprint
- Status changed from New to In Progress
It seems likely that there will be spurious "stats not available" messages during container startup and teardown, it would be nice to suppress those since that is an expected condition. What do you think about adding a counter or timer so that it only prints the notice after several consecutive failed attempts to read the stats?
- Assigned To set to Tom Clegg
Peter Amstutz wrote:
It seems likely that there will be spurious "stats not available" messages during container startup and teardown, it would be nice to suppress those since that is an expected condition. What do you think about adding a counter or timer so that it only prints the notice after several consecutive failed attempts to read the stats?
That seems reasonable. "Several" should probably be two: neither of those races lasts more than a few milliseconds unless something's wrong. It would be ideal to handle the startup race by waiting a short time and repolling, to avoid missing the first interval of data. Added this to #5523 ("verbose logs sound too scary") rather than blocking this ("stats are incorrect").
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:31d8ece649c63b008ed79a930a5b237ec795ff22.
Also available in: Atom
PDF