Story #7901
closed[Crunch] Script to report maximum resource utilization from a job log
100%
Description
Use case: A bioinformatician is cost-optimizing a functional pipeline. They want to know whether the individual jobs in the pipeline are making efficient use of the compute node(s) available to it.
Provide a script in source:tools/crunchstat-summary/bin/crunchstat-summary to help answer that question. It should take a job UUID and/or job log filename as an argument, parse the crunchstat lines in the corresponding log, and report the maximum point-in-time reading for each of the following resources:
- CPU used
- Report a single top-style figure like "1600%" (meaning full utilization of 16 cores)
- Don't bother with separate user/sys reporting; figuring out a good way to report those figures at maximum utilization will be a separate story if needed
- RAM used
- Swap used
- net throughput for each net device mentioned
- IO throughput for each blk device mentioned
Essentially, print the maximum value of every statistic ever printed by crunchstat. The max value before the "-- interval" delimiter will indicate maximum total usage by any one task. The max value of stats that come after the "-- interval" delimiter will indicate maximum throughput / CPU load.
When selecting the max value of an interval/throughput value, note the actual interval too. For example:crunchstat: cpu 33.0700 user 3.9600 sys 8 cpus -- interval 10.0003 seconds 33.0700 user 3.9600 sys crunchstat: cpu 66.1500 user 7.9600 sys 8 cpus -- interval 11.0003 seconds 33.0800 user 4.0000 sys --- max cpu 66.1400 user 3.9600 sys 8 cpus max cpu interval 11.003 seconds 37.0800 user+sys = 337%
It's understood that the script can't report whether the underlying tool would run just as well with fewer resources. It's only reporting whether the tool is using the resources it has.
Some example crunchstat output:crunchstat: mem 57286656 cache 0 swap 27 pgmajfault 1510322176 rss crunchstat: cpu 33.0700 user 3.9600 sys 8 cpus -- interval 10.0003 seconds 29.8600 user 3.4900 sys crunchstat: blkio:202:16 0 write 3665920 read -- interval 10.0003 seconds 0 write 0 read crunchstat: net:eth0 3301 tx 153030 rx -- interval 10.0003 seconds 0 tx 0 rx
Simplifying restrictions¶
No need (yet) to publish a package. Do add enough setup.py stuff to run tests, though.
If the intervals vary a lot, the max figure reported won't necessarily be the max rate. But we don't expect intervals to vary enough to matter, so we'll live with the discrepancy.
Updated by Tom Clegg over 9 years ago
7901-crunchstat-summary @ 1d4f047 and arvados-dev|575f751
$ crunchstat-summary --job 4xphq-8i9sb-jq0ekny1xou3zoh | expand -t12 category metric max max_rate net:eth0 rx 1754364530 41658344.87 net:eth0 tx 38837956 920817.97 mem rss 349814784 - mem cache 1678139392 - mem swap 0 - mem pgmajfault 0 - keepcalls put 0 0.00 keepcalls get 0 0.00 keepcache miss 0 0.00 keepcache hit 0 0.00 blkio:0:0 write 0 0.00 blkio:0:0 read 0 0.00 net:keep0 rx 0 0.00 net:keep0 tx 0 0.00 cpu sys 1.92 0.04 cpu user 3.83 0.09 cpu cpus 8 - fuseops write 0 0.00 fuseops read 0 0.00
Or, equivalently,
$ zcat ~/arvados/tools/crunchstat-summary/tests/logfile_20151204190335.txt.gz | crunchstat-summary
or
$ crunchstat-summary --log-file ~/arvados/tools/crunchstat-summary/tests/logfile_20151204190335.txt.gz
Updated by Tom Clegg over 9 years ago
another example
$ crunchstat-summary --job qr1hi-8i9sb-qwr5epuobcre09m | expand -t12 category metric max max_rate mem cache 15702978560 - mem pgmajfault 315 - mem rss 10732777472 - cpu sys 152.38 0.11 cpu user 29305.28 2.55 cpu cpus 8 - net:eth0 rx 44485200 649806.51 net:eth0 tx 6898753050 74613976.17
Updated by Brett Smith over 9 years ago
Reviewing 1d4f047.
I assume the answer is no, but just being extra-safe: I don't recognize the collection.py
Crunch script from anywhere. Should we be concerned about the PDHs that appear in those logs, privacy-wise? If so, we might need to aggressively garbage collect the commit from the Git server. I double-checked that all the API tokens have been expired, and they have, so that's reassuring.
In Summarizer._logdata, you slipped into Ruby mind and raise bare strings a few times. That's a bug:
>>> raise "foo" Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: exceptions must be old-style classes or derived from BaseException, not str
I suggest ValueError as the most Pythonic exception for both these cases, but I'm not wedded to it.
Nothing else is critical or would block a merge, but a few readability suggestions:
There a few single-letter variables lurking around: j
and c
in Summarizer._logdata, m
in Summarizer.run, and s
in the test method. They're all innocuous now, but code grows…
Suggest str.endswith('.ext')
over str[-4:] == '.ext'
. This comes up in Summarizer.run, Summarizer._logdata, and the test method. Not writing the length yourself is DRYer.
Anything you can do with printf-style strings, you can also do with format strings, if you like:
{!r}
to write an object's repr, as in Summarizer.run.{:.2f}
to specify precision for a float.
There are some unused imports that could be cleaned up: itertools and os in summarizer.py. fnmatch in the tests. (Did you want to import glob and make your loop for fnm in glob.glob(os.path.join(dirname, '*.txt.gz'))
?)
Props for for val, stat in zip(words[::2], words[1::2])
, that was a fun line of code to read.
Thanks.
Updated by Tom Clegg over 9 years ago
Brett Smith wrote:
I assume the answer is no, but just being extra-safe: I don't recognize the
collection.py
Crunch script from anywhere. Should we be concerned about the PDHs that appear in those logs, privacy-wise? If so, we might need to aggressively garbage collect the commit from the Git server. I double-checked that all the API tokens have been expired, and they have, so that's reassuring.
This is from the Keep performance tests on 4xphq, from #7780. The input is LobSTR reference data. (Sorry, probably could have saved you some time by mentioning where those examples came from.)
In Summarizer._logdata, you slipped into Ruby mind and raise bare strings a few times. That's a bug:
Oops, fixed. ValueError it is.
There a few single-letter variables lurking around:
j
andc
in Summarizer._logdata,m
in Summarizer.run, ands
in the test method. They're all innocuous now, but code grows…
fixed → job, collection, summarizer.
Suggest
str.endswith('.ext')
overstr[-4:] == '.ext'
. This comes up in Summarizer.run, Summarizer._logdata, and the test method. Not writing the length yourself is DRYer.
Ah yes, that's better.
Anything you can do with printf-style strings, you can also do with format strings, if you like:
{!r}
to write an object's repr, as in Summarizer.run.{:.2f}
to specify precision for a float.
Thanks. Updated these so we're just using format() everywhere.
There are some unused imports that could be cleaned up: itertools and os in summarizer.py. fnmatch in the tests. (Did you want to import glob and make your loop
for fnm in glob.glob(os.path.join(dirname, '*.txt.gz'))
?)
Yes! I started out hoping to use fnmatch and didn't go back and clean this up. glob() is better.
Props for
for val, stat in zip(words[::2], words[1::2])
, that was a fun line of code to read.
I merely selected it from various approaches on stackexchange... but yes, I agree. :)
rebased, now at 40158e8
Updated by Tom Clegg over 9 years ago
7901-human-summary @ cf4c50a
category metric task_max task_max_rate job_total blkio:0:0 read 0 0.00 0 blkio:0:0 write 0 0.00 0 cpu cpus 8 - - cpu sys 1.92 0.04 1.92 cpu user 3.83 0.09 3.83 cpu user+sys 5.75 0.13 5.75 fuseops read 0 0.00 0 fuseops write 0 0.00 0 keepcache hit 0 0.00 0 keepcache miss 0 0.00 0 keepcalls get 0 0.00 0 keepcalls put 0 0.00 0 mem cache 1678139392 - - mem pgmajfault 0 - 0 mem rss 349814784 - - mem swap 0 - - net:eth0 rx 1754364530 41658344.87 1754364530 net:eth0 tx 38837956 920817.97 38837956 net:eth0 tx+rx 1793202486 42579162.83 1793202486 net:keep0 rx 0 0.00 0 net:keep0 tx 0 0.00 0 net:keep0 tx+rx 0 0.00 0 time elapsed 80 - 80 # Max CPU time spent by a single task: 5.75s # Max CPU usage in a single interval: 13.00% # Overall CPU usage: 7.19% # Max memory used by a single task: 0.35GB # Max network traffic in a single task: 1.79GB # Max network speed in a single interval: 42.58MB/s
Updated by Brett Smith over 9 years ago
Reviewing 7901-human-summary @ cf4c50a.
This is good to merge. One thing that might improve readability is using collections.defaultdict for some of the deep dictionaries. This can save you from doing lots of key in dict
and dict.setdefault
guarding. I think the two variable declarations that would pay off most (with the right imports) are:
task_stats = collections.defaultdict(functools.partial(collections.defaultdict, dict))
job_tot = collections.defaultdict(functools.partial(collections.defaultdict, int))
And then this is borderline trivial, but in the for args in…
loop in _report_gen, using tuples for the args instead of lists would be a little more Pythonic. There's a lot of convention that tuples are good for fixed-length, heterogeneous sequences; where lists are good for variable-length, homogeneous sequences.
Thanks.
Updated by Tom Clegg over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:0d66f5f5c5173f0faad6318a9ac87d11964e5748.