Project

General

Profile

Actions

Story #14645

closed

[arv-mount] Report FUSE operation latency

Added by Peter Amstutz about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Eric Biagiotti
Category:
-
Target version:
Start date:
03/14/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto

Description

Record request time for FUSE operations in arv-mount.

Determine if prometheus_client Python library is useful to aggregate counters/averages.

Report total time spent in FUSE operations and last interval in crunchstat format.


Subtasks 1 (0 open1 closed)

Task #14663: Review 14645-fuse-operations-reportingResolvedEric Biagiotti03/14/2019

Actions

Related issues 1 (1 open0 closed)

Blocks Arvados - Story #14952: [crunchstat-summary] Add arv-mount FUSE ops timeNew

Actions
Actions #1

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 6 years ago

  • Subject changed from [arv-mount] Report GET/POST latency to [arv-mount] Report GET/PUT latency
  • Description updated (diff)
  • Status changed from In Progress to New
Actions #3

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
  • Subject changed from [arv-mount] Report GET/PUT latency to [arv-mount] Report FUSE operation latency
Actions #5

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz about 6 years ago

  • Story points set to 1.0
Actions #7

Updated by Peter Amstutz about 6 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #8

Updated by Tom Morris almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Actions #9

Updated by Lucas Di Pentima almost 6 years ago

  • Assigned To set to Lucas Di Pentima
Actions #10

Updated by Lucas Di Pentima almost 6 years ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Actions #11

Updated by Tom Morris almost 6 years ago

  • Assigned To changed from Lucas Di Pentima to Eric Biagiotti
Actions #12

Updated by Eric Biagiotti almost 6 years ago

  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint
Actions #13

Updated by Eric Biagiotti almost 6 years ago

  • Story points changed from 1.0 to 3.0
Actions #14

Updated by Eric Biagiotti almost 6 years ago

  • Target version changed from 2019-02-13 Sprint to 2019-02-27 Sprint
Actions #15

Updated by Eric Biagiotti almost 6 years ago

  • Target version changed from 2019-02-27 Sprint to 2019-03-13 Sprint
Actions #16

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions #17

Updated by Eric Biagiotti almost 6 years ago

Is our goal to make use of prometheus' ability to serve stats, or just its aggregation functionality? Is the requirement to provide this data in crunchstat format to improve output or do we already aggregate data based on this output format?

Actions #18

Updated by Peter Amstutz almost 6 years ago

Eric Biagiotti wrote:

Is our goal to make use of prometheus' ability to serve stats, or just its aggregation functionality?

Just the aggregation functionality, if it saves us any effort (someday we might want to export the metrics to a prometheus server, but that's not this story).

Is the requirement to provide this data in crunchstat format to improve output or do we already aggregate data based on this output format?

arv-mount already produces crunchstat format logs and the crunchstat-summary tool already knows how to consume crunchstat logs, so this store is just adding timing metrics.

Actions #19

Updated by Eric Biagiotti almost 6 years ago

  • Target version changed from 2019-03-13 Sprint to 2019-03-27 Sprint
Actions #20

Updated by Eric Biagiotti almost 6 years ago

  • Status changed from New to In Progress
Actions #21

Updated by Eric Biagiotti almost 6 years ago

  • Related to Story #14952: [crunchstat-summary] Add arv-mount FUSE ops time added
Actions #22

Updated by Eric Biagiotti almost 6 years ago

  • Related to deleted (Story #14952: [crunchstat-summary] Add arv-mount FUSE ops time)
Actions #23

Updated by Eric Biagiotti almost 6 years ago

  • Blocks Story #14952: [crunchstat-summary] Add arv-mount FUSE ops time added
Actions #24

Updated by Eric Biagiotti almost 6 years ago

Ready for review at 97432e0e9d32a41c48a5298a0dbfd70b310e20e7

  • Added fuse operation timing using prometheus_client. Added a single Summary metric and individual labels for each operation. This should make querying for total fuse ops timing easier if we use prometheus to serve stats, while also making it possible to look at individual ops.
  • Made arvmount crunchstat writing more generic. It can now accommodate different number of stats and types. (i.e. just a single float stat like ops time)
Potential improvements:
  • Increase granularity. Wrap the specific llfuse calls, not the operator function?
  • In __init__ update ops counters (write_ops_counter, read_ops_counter) to use prometheus instead of arvados.keep.Counter?
To Test:
  • Set-up environment, ARVADOS_API_TOKEN, etc.
  • Set-up a virtualenv
  • $ (cd arvados/services/fuse/ && python setup.py install)
  • $ mkdir test-dir
  • $ arv-mount --crunchstat-interval=1 --foreground --read-write --mount-tmp=tmp test-dir
  • In another terminal, cd into the mounted directory and play around. You should see fuseopstime increasing.

Unit Tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1125/

Actions #25

Updated by Peter Amstutz almost 6 years ago

crunchstat: fuseopstime 117.273453 seconds -- interval 3.0000 seconds 0.000000 seconds

Format isn't quite right. The format should be:

crunchstat: <category> <value0> <name0> (<value1> <name1>...) --- interval N seconds <deltavalue0> <name0> (<deltavalue1> <name1>...)" 

(I don't know why the name comes after the stat value)

I think we want to log all the individual stats. Some operations interact with the API server, some interact with keep servers, some are cached, some are not. So to diagnose performance problems we need to be able to distinguish where it is spending its time.

I noticed that a single very long operation won't be recorded until it completes. When graphed, the stats may be very spiky -- in other words an operation that takes 40 seconds won't show up as four 10-second intervals of 10 seconds of operation time each, it will show up as 40 seconds of operation time in one interval. I don't know if that can be helped.

Actions #26

Updated by Eric Biagiotti almost 6 years ago

Latest at 0e0cbc43912e4bd8ad3fda616e552380d61c3f8d
Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1138/

Peter Amstutz wrote:

[...]

Format isn't quite right. The format should be:

Fixed

I think we want to log all the individual stats. Some operations interact with the API server, some interact with keep servers, some are cached, some are not. So to diagnose performance problems we need to be able to distinguish where it is spending its time.

Done and added individual op counts also.

I noticed that a single very long operation won't be recorded until it completes. When graphed, the stats may be very spiky -- in other words an operation that takes 40 seconds won't show up as four 10-second intervals of 10 seconds of operation time each, it will show up as 40 seconds of operation time in one interval. I don't know if that can be helped.

I'll add a feature issue for this so we have it documented.

FYI, most of daa4ee15fdf51a573f96b74c9fd01747b63bbeb0 was changed in the next commit. Keep that in mind if you are going commit by commit.

Also, here is an example of how the prometheus Summary metric is organized. It might be useful for following the logic.

Metric(arvmount_fuse_operations_seconds, 
Time spent during FUSE operations, summary, , [
Sample(name='arvmount_fuse_operations_seconds_count', labels={'op': u'write'}, value=0.0, ...), 
Sample(name='arvmount_fuse_operations_seconds_sum', labels={'op': u'write'}, value=0.0, ...), 
Sample(name='arvmount_fuse_operations_seconds_created', labels={'op': u'write'}, value=0.0, ...), 
Sample(name='arvmount_fuse_operations_seconds_count', labels={'op': u'read'}, value=0.0, ...), 
Sample(name='arvmount_fuse_operations_seconds_sum', labels={'op': u'read'}, value=0.0, ...), 
Sample(name='arvmount_fuse_operations_seconds_created', labels={'op': u'read'}, ...)])]\n" 
Actions #27

Updated by Peter Amstutz almost 6 years ago

The output format is not quite right, the 1st position is the resource being measured, followed by the resource's metrics. So instead of this:

crunchstat: fuseop:count 0 open -- interval 10.0000 seconds 0 open
crunchstat: fuseop:time 0.000000 open -- interval 10.0000 seconds 0.000000 open

It should be something like this:

crunchstat: fuseop:open 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
Actions #28

Updated by Eric Biagiotti almost 6 years ago

Peter Amstutz wrote:

The output format is not quite right, the 1st position is the resource being measured, followed by the resource's metrics. So instead of this:

[...]

It should be something like this:

[...]

Fixed in fb8c51a195f2d9603cb129d8c3a21dc5d700aab3
Unit tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1140/

crunchstat: fuseop:flush 10 count 0.000100 time -- interval 1.0000 seconds 10 count 0.000100 time
crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 1.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:read 5 count 0.000292 time -- interval 1.0000 seconds 5 count 0.000292 time
crunchstat: fuseop:write 5 count 0.013155 time -- interval 1.0000 seconds 5 count 0.013155 time
...

I also manually added a few fuseop:flush lines to the crunchstat-summary test logs and ran the tests. They obviously break, but the new lines are being processed correctly. A few more decimal places would be nice, but that's for the follow up story #14952.

--- 9,16 ----
  cpu    sys    1.92    0.04    1.92
  cpu    user    3.83    0.09    3.83
  cpu    user+sys    5.75    0.13    5.75
+ fuseop:flush    count    10    10.00    10
+ fuseop:flush    time    0.00    0.00    0.00
  fuseops    read    0    0    0
  fuseops    write    0    0    0
  keepcache    hit    0    0    0

Actions #29

Updated by Eric Biagiotti almost 6 years ago

Merged master into this branch to get some unit test fixes at 215d15bc03a38be1965a5d9df21417a3f7eae032
Units tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1143/

Actions #30

Updated by Eric Biagiotti almost 6 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF