Feature #23245: Go FUSE driver reports crunchstats - Arvados

Actions

Copy link

Feature #23245

closed

Go FUSE driver reports crunchstats

Added by Brett Smith 5 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Lisa Knox

Category:

FUSE

Target version:

Development 2026-01-06

Story points:

Release:

Arvados 3.2.1

Release relationship:

Auto

Description

Have the Go FUSE driver write statistics to stderr in the same way that the Python FUSE driver does. It is invoked with the same command line switch (--crunchstat-interval=N prints stats every N seconds). Statistics are written in the exact same format as the Python FUSE driver, which looks like this (every --crunchstat-interval seconds, on stderr):

crunchstat: keepcalls 0 put 3753 get -- interval 10.0000 seconds 0 put 0 get
crunchstat: net:keep0 0 tx 484065792 rx -- interval 10.0000 seconds 0 tx 0 rx
crunchstat: keepcache 3700 hit 8 miss -- interval 10.0000 seconds 0 hit 0 miss
crunchstat: blkio:0:0 0 write 484065792 read -- interval 10.0000 seconds 0 write 0 read
crunchstat: fuseops 0 write 3694 read -- interval 10.0000 seconds 0 write 0 read
crunchstat: fuseop:read 3694 count 8.274071 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:write 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:destroy 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:on_event 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:getattr 12 count 0.000357 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:setattr 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:lookup 10 count 0.066938 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:forget 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:open 1 count 0.000043 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:release 2 count 0.000208 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:opendir 1 count 0.000027 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:readdir 2 count 0.000004 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:statfs 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:create 1 count 0.000523 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:mkdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:unlink 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:rmdir 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:rename 0 count 0.000000 time -- interval 10.0000 seconds 0 count 0.000000 time
crunchstat: fuseop:flush 6 count 0.000083 time -- interval 10.0000 seconds 0 count 0.000000 time

The ordering of stats is not significant.

Meaning/source of stats:

fuseop:* reports calls and cumulative elapsed time for each fuse API call
blkio:0:0 reports traffic at the fuse layer
net:keep0, keepcalls, and keepcache report back-end data traffic (keep), which will require instrumenting keepclient as well -- these are not included here, they will be added in #23284
fuseops appears to be an additional spelling of fuseop:read and fuseop:write

Implementation:

We should use prometheus to collect metrics like

arvados_fuse_io_bytes{op="write"} (and "read")
arvados_fuse_ops{op="write"} (and "read" and ...)
arvados_fuse_op_seconds{op="write"}

This leaves the option to also export fuse and keepclient metrics via normal prometheus API in future.

For now, we can use Gather (source:sdk/go/httpserver/metrics.go has an example of the first part of this) and report them in crunchstat format.

keepFS should add fields

Registry      *prometheus.Registry
StatsInterval time.Duration
StatsOutput   io.Writer

mBytes        *prometheus.CounterVec
mSeconds      *prometheus.CounterVec
mOps          *prometheus.CounterVec

keepFS.Init should
- create the CounterVecs and register them (similar to source:lib/dispatchcloud/worker/pool.go)
- (if StatsInterval > 0 && StatsOutput != nil) start a ticker, and start a goroutine to call fs.writeStats() on each tick
- populate each vec with all of the applicable op values (otherwise prometheus won't start reporting "chmod" stats until the first time chmod is called) -- see example in newInstrumentedInstanceSet in source:lib/dispatchcloud/driver.go
keepFS should get a Destroy method to stop the stats goroutine (fuse library will call this at shutdown)
keepfs_test.go should get a TearDownTest method that calls Destroy
(*mountCommand)RunCommand should create a prometheus.NewRegistry() and pass it to keepFS (Registry field) and keepclient.KeepClient (TBD)

Something like this would let us add metrics to every op with one line:

func (fs *keepFS) Chmod(path string, mode uint32) (errc int) {
  defer fs.reportMetrics("chmod", time.Now(), nil)
  // ...
}

func (fs *keepFS) Write(path string, buf []byte, ofst int64, fh uint64) (n int) {
  defer fs.reportMetrics("write", time.Now(), &n)
  // ...
}

func (fs *keepFS) reportMetrics(op string, t0 time.Time, bytes *int) {
  fs.mOps.WithLabelValues(op).Add(time.Since(t0).Seconds())
  fs.mSeconds.WithLabelValues(op).Add(time.Since(t0).Seconds())
  if bytes != nil {
    fs.mBytes.WithLabelValues(op).Add(float64(*bytes))
  }
}

Subtasks 1 (0 open — 1 closed)

Related issues 4 (2 open — 2 closed)

Actions

Copy link

Updated by Brett Smith 5 months ago

Subtask #23249 added

Actions

Copy link

Updated by Tom Clegg 5 months ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg 5 months ago

Precedes Feature #23284: Keepclient reports backend keep metrics added

Actions

Copy link

Updated by Tom Clegg 5 months ago

Description updated (diff)

Actions

Copy link

Updated by Lisa Knox 4 months ago

It was decided in discussion that the first iteration of this ticket can be to have a string printed to stderr at the interval passed with -crunchstat-interval.

23245-go-fuse-crunchstat @ 19fb7325d5a973b11e5a0b3aae8ec093edc68ef5

developer-run-tests: #4934

passing -crunchstat-interval n now logs the word "tick" every n seconds with the INFO flag
the default interval is 0, which disables logging
passing < 0 throws an error
an assertion was added to lib/mount/command_test.go

Actions

Copy link

Updated by Tom Clegg 4 months ago · Edited

Status changed from New to In Progress
Description updated (diff)

I think we should move the logging code into keepFS, so RunCommand() parses the command line flag and passes the interval to keepFS in a new StatsInterval time.Duration field.

As written, ticker.Stop() in repeatEvery() never gets called, because for range ticker.C runs forever. (In fact, even Stop doesn't close the channel, so for range ticker.C will wait forever regardless of who calls Stop. Stop is no longer needed for the sake of garbage collection either, so we actually don't need to call it at all.)

One way to do this would be

keepFS adds a field done chan struct{}

(*keepFS)Init() starts a timer, creates the done channel, and runs a goroutine

fs.done = make(chan struct{})
go func() {
        if fs.StatsInterval <= 0 {
                return
        }
        ticker := time.NewTicker(fs.StatsInterval)
        for {
                select {
                case <-done:
                        return
                case <-ticker.C:
                        fs.writeStats()
                }
        }
}()

(*keepFS)Destroy() calls close(fs.done) to stop stats (it doesn't need to check whether stats are active, because Init created the channel regardless)

I dislike sleeping 1.2 seconds in the test. We can avoid that by making -crunchstat-interval a float, and testing it with 0.001.

The test could also test that logging stopped as expected: after unmount, sleep 0.01, check stderr.Len(), wait another 2x interval, check stderr.Len() again. In principle this is inherently racy unless we add more coordination -- logStats() might start before done is closed, and produce output after Unmount returns -- but practically I think "sleep 0.01" will be good enough.

Actions

Copy link

Updated by Brett Smith 4 months ago

Target version changed from Development 2025-11-12 to Development 2025-11-26

Actions

Copy link

Updated by Lisa Knox 4 months ago

23245-go-fuse-crunchstat @ cb4eb8a27e69d9e821c3267ed6cf2a77224fe039

developer-run-tests: #4937

I think we should move the logging code into keepFS...

done

As written, ticker.Stop() in repeatEvery() never gets called...

Implemented the Destroy method to close the done channel

I dislike sleeping 1.2 seconds in the test...

Changed the command line parse to coerce input to float64 and adjust test to only sleep for 20ms. Going down to 0.0001s as suggested introduces some flakiness

The test could also test that logging stopped as expected...

Fixed, but kinda moot because if the logs keep printing, that means the channel hasn't closed and the test will run forever.

Actions

Copy link

Updated by Lisa Knox 4 months ago

developer-run-tests: #4940

Actions

Copy link

#10

Updated by Tom Clegg 4 months ago

Look good, just a couple of things:

Help message "-crunch-stat-interval must be non-negative" should be "-crunchstat-interval"

fs_test.go should get a TearDownTest method that calls s.fs.Destroy() (rather than rely on the fact that currently Destroy isn't needed as long as statsInterval <= 0)

Going down to 0.0001s as suggested introduces some flakiness

Hm, curious what kind of flakiness? I can see that making the "logging stopped" check flaky without the "after unmount, sleep 0.01, check stderr.Len(), ..." part, but I tried 0.0001s here and it passed reliably so I'm wondering.

Actions

Copy link

#11

Updated by Lisa Knox 4 months ago · Edited

23245-go-fuse-crunchstat @ f855137bdb31fa403ca0ea9c2df5118af8019513

developer-run-tests: #4941

All items addressed.

In testing, I found that when I set the interval at 0.0001, even if I set the sleep timer for 0, it somehow always prints a few logs. Therefore, it made no difference if I set the sleep to 0.0002 0r 0.000000001, the test always passes. Raising both the interval and the sleep timer gave me more rational behavior, so I went with it.

Actions

Copy link

#12

Updated by Tom Clegg 4 months ago

Gotcha. IOW, if the interval is too small, the test can't detect whether there's an interval before the first log entry, right? But we aren't actually checking this, and we probably shouldn't, because it would be fine (perhaps preferable) to log zeros right away before the first interval. I don't want to get hung up on this one but in general it's a good investment to do "sleep tiny intervals until the thing happens or {timeout} passes" rather than "sleep {timeout}, then check that the thing has happened". "Confirm logging has stopped" is the only place we can't do that.

In Go you can use 0 as a float64, you don't need to say 0.0.

I don't think we need to fix this but now that I've thought of it I have to say it: if you ask for -crunchstat-interval 0.0000000000000001 the non-negative check passes, but then we truncate it to a duration of 0 nanoseconds, so there's no ticker. And if you ask for -crunchstat-interval NaN you get the same thing, because NaN < 0 is also false.

LGTM, thanks!

Actions

Copy link

#13

Updated by Tom Clegg 4 months ago

Related to Feature #23308: Go FUSE driver crunchstat should report client-side traffic (need to add to keepclient metrics) added

Actions

Copy link

#14

Updated by Lisa Knox 4 months ago · Edited

23245-fuse-logging @ 88a1c5eb8ab5f485b08f8a608ee362f7484e6121 (Note the changed branch name)

developer-run-tests: #4954

✅ All agreed upon points are implemented / addressed.
- Logging for FUSE operations is in place.
- Tests written
✅ Anything not implemented (discovered or discussed during work) has a follow-up story.
- Adding Keep metrics to the logs (per #23284) should immediately follow this iteration.
✅ Code is tested and passing, both automated and manual, what manual testing was done is described
✅ The tested code incorporates recent main branch changes.
❌ New or changed UX/UX and has gotten feedback from stakeholders.
- I copied the formatting from the example on the ticket, but I suspect we don't actually want it to look exactly like that.
n/a Documentation has been updated.
✅ Behaves appropriately at the intended scale (describe intended scale).
✅ Considered backwards and forwards compatibility issues between client and server.
- no change
✅ Follows our coding standards and GUI style guidelines.

Notes:

I elected to move the command test out of the main command test to eliminate excess logging during tests
The only reason logrus is used for logging instead of fmt.Println is because I could not figure out how to check the stdout in a test and I think it's probably ok either way.
I saw the Namespace & Subsystem fields in #23284 but I'm not sure if or how they should be applied here. As in, I can't find where the values are consumed, so it's not clear why they are there.

keepfs_test.go should get a TearDownTest method that calls Destroy

keepfs_test.go doesn't exist, but fs.go already has a teardown method that calls destroy, and I think that's what this line means?

Actions

Copy link

#15

Updated by Tom Clegg 4 months ago

I elected to move the command test out of the main command test to eliminate excess logging during tests

Having a separate test specifically for crunchstat logging is good, yes. We could also add c.Check(stderr.String(), check.Equals, "") to TestMount().

how to check the stdout in a test

I think we should add a StatsOutput io.Writer field to keepFS, and write stats to it with fmt.Fprintf().

We don't really want the structured log formatting stuff logrus provides -- we just want to write literal text in our own format.
Test cases can provide a bytes.Buffer, mountCommand can provide stderr
Maybe someday mountCommand will have an option like -crunchstat-fd=3 to send to a different file descriptor instead of combining with stderr

Namespace & Subsystem

They get prepended to the metric name, so instead of

       fs.mBytes = prometheus.NewCounterVec(prometheus.CounterOpts{
               Name: "arvados_fuse_bytes",
               Help: "Bytes read/written by the FUSE filesystem",
       }, []string{"fuseop"})

we can say

       fs.mBytes = prometheus.NewCounterVec(prometheus.CounterOpts{
               Namespace: "arvados",
               Subsystem: "fuse",
               Name:      "bytes",
               Help:      "Bytes read/written by the FUSE filesystem",
       }, []string{"fuseop"})

The resulting metrics are identical but the latter is how we spell it elsewhere.

fs.go already has a teardown method that calls destroy

Right, I meant fs_test.go, and we already added the TearDown method in the previous branch.

In the keepFS struct definition:

type keepFS struct {
        fuse.FileSystemBase
        Client     *arvados.Client
        KeepClient *keepclient.KeepClient
        ReadOnly   bool
        Uid        int
        Gid        int
        Logger     logrus.FieldLogger

        root          arvados.CustomFileSystem
        open          map[uint64]*sharedFile
        lastFH        uint64
        statsInterval time.Duration
        done          chan struct{}
        Registry      *prometheus.Registry
        mBytes        *prometheus.CounterVec
        mOps          *prometheus.CounterVec
        mSeconds      *prometheus.CounterVec

I think statsInterval should be StatsInterval and move up to the exported/capitalized section. That section has fields that get set by the caller. The second section has fields that are managed by keepFS itself. (Since keepFS itself is private to this package, the caller is by definition also in this package, so the language doesn't actually prevent the private fields from being accessed, but it's a helpful distinction to make anyway.)

In Init, we are assuming the initial previousMetrics will have all zeroes. So we could just use a nil map (var previousMetrics map[string]float64) instead of calling gatherMetrics.

In gatherMetrics:

        metricFamilies, err := reg.Gather()
        if err == nil {
                for _, mf := range metricFamilies {
                ...

should be

        metricFamilies, err := reg.Gather()
        if err != nil {
                return nil
        }
        for _, mf := range metricFamilies {

or even

        metricFamilies, _ := reg.Gather() // ignore errors
        for _, mf := range metricFamilies {

(since it's safe to read/iterate over a nil slice/map)

FormatMetrics() should be private, i.e., formatMetrics(). Or better, (*keepFS)writeMetrics(), and it writes directly to fs.CrunchstatOutput instead of building up a slice of strings for the caller to write.

I did a bit of manual testing and the output looks plausible.

The blkio:0.0 entries aren't implemented yet.

I think the test case would be more manageable with a "get the line for stat X" closure, like

statline := func(stat string) string {
  for _, line := range lines {
    if strings.HasPrefix(line, "crunchstat: "+stat+" ") {
      return line
    }
  }
  return "" 
}

c.Check(statline("fuseop:read"), Equals, "...")
c.Check(statline("fuseop:write"), Equals, "...")

That would also avoid this pattern

        for _, line := range lines {
                if strings.HasPrefix(line, "crunchstat: fuseop:read ") {
                        c.Check(line, Equals, "crunchstat: fuseop:read 8 count 0.250000 time -- interval 1.0000 seconds 5 count 0.150000 time")

...which passes when there is no fuseop:read line at all.

Actions

Copy link

#16

Updated by Tom Clegg 4 months ago

Tom Clegg wrote in #note-15:

FormatMetrics() should be private, i.e., formatMetrics(). Or better, (*keepFS)writeMetrics(), and it writes directly to fs.CrunchstatOutput instead of building up a slice of strings for the caller to write.

On second thought I take back the part about making it a *keepfs method. writeMetrics(out io.Writer, ...) would be tidier for testing.

Actions

Copy link

#17

Updated by Brett Smith 4 months ago

Lisa Knox wrote in #note-14:

I copied the formatting from the example on the ticket, but I suspect we don't actually want it to look exactly like that.

We do because crunchstat-summary consumes this format. source:tools/crunchstat-summary/crunchstat_summary/summarizer.py#L86

Actions

Copy link

#18

Updated by Lisa Knox 4 months ago

23245-fuse-logging @ 38e47b672aba2f2b959d5c840f0c403ac59a98ac

developer-run-tests: #4959

I think we should add a StatsOutput io.Writer field to keepFS...

Done, and it was surprisingly straightforward to do.

I think statsInterval should be StatsInterval and move up to the exported/capitalized section...

Thank you for this, I had wrongly assumed that the different sections were arranged semantically by their functions.

In Init, we are assuming the initial previousMetrics will have all zeroes. So we could just use a nil map (var previousMetrics map[string]float64) instead of calling gatherMetrics...

I had assumed it would panic because that feels like typical Go behavior.

The blkio:0.0 entries aren't implemented yet.

Done, plus keepcache, keepcalls, and net:keep0, the last of which I don't really know how to test. It looks like the stats are initiated and registered properly, and that I'm collecting them properly, but I have yet to figure out how it's triggered.

I think the test case would be more manageable with a "get the line for stat X" closure...

Done, it's much cleaner now.

On second thought I take back the part about making it a *keepfs method. writeMetrics(out io.Writer, ...) would be tidier for testing.

I was planning on pushing back on this for that reason exactly.

Actions

Copy link

#19

Updated by Brett Smith 4 months ago

Precedes Feature #23332: Go FUSE driver reports crunchstats net:keep0, keepcalls, keepcache added

Actions

Copy link

#20

Updated by Brett Smith 4 months ago

Related to Feature #23333: Go FUSE Driver Phase 1: crunch-run uses arvados-client mount added

Actions

Copy link

#21

Updated by Brett Smith 4 months ago

Target version changed from Development 2025-11-26 to Development 2025-12-10

Actions

Copy link

#22

Updated by Brett Smith 4 months ago

Lisa Knox wrote in #note-18:

23245-fuse-logging @ 38e47b672aba2f2b959d5c840f0c403ac59a98ac

Going over Tom's comments and this branch, I agree you've addressed most everything, except for one note below.

In Init, we are assuming the initial previousMetrics will have all zeroes. So we could just use a nil map (var previousMetrics map[string]float64) instead of calling gatherMetrics...

I had assumed it would panic because that feels like typical Go behavior.

Specifically for slices and maps, reads from nil are treated the same as an empty slice/map, and idiomatic Go often leans on this fact.

The blkio:0.0 entries aren't implemented yet.

Done, plus keepcache, keepcalls, and net:keep0, the last of which I don't really know how to test. It looks like the stats are initiated and registered properly, and that I'm collecting them properly, but I have yet to figure out how it's triggered.

Unfortunately this is beyond me too. Personally I think I would be comfortable merging the branch with the current testing, and then seeing if Tom has any follow-up ideas for a test.

On second thought I take back the part about making it a *keepfs method. writeMetrics(out io.Writer, ...) would be tidier for testing.

I was planning on pushing back on this for that reason exactly.

If I'm following right, it seems like there was some miscommunication here. Tom took back the part about making it a *keepfs method, yet that's exactly what the latest changes do. Am I missing something?

One style note, to avoid pyramiding, in gatherMetrics instead of this:

metricName := mf.GetName()
if len(metric.GetLabel()) > 0 {
    // … do the rest…
}

it would be nicer written like:

metricName := mf.GetName()
if len(metric.GetLabel()) == 0 {
    continue
}
// … do the rest…

Thanks.

Actions

Copy link

#23

Updated by Lisa Knox 4 months ago · Edited

23245-fuse-logging @ 8faaebd3553d42e4e0f9b29ec3454f04adc6fc77

developer-run-tests: #4963

Style note taken

If I'm following right, it seems like there was some miscommunication here. Tom took back the part about making it a *keepfs method, yet that's exactly what the latest changes do. Am I missing something?

No, I misunderstood the comment, or rather, the retraction. I returned it to a public function. The part I wanted to push back on is the idea of passing the writer in and having FormatMetrics itself write to stdout. We already test that the logging happens, so it makes sense to me to independently test that FormatMetrics formats the metrics, which I like because FormatMetrics is a somewhat complicated function.

Actually, looking at it again, what we would ideally do is to test that each individual operation populates the correct values in the correct line in the logs. For example, mount the fs, then create a directory, then check that mkdir was called in the log lines. As far as I can tell, that is not what happens in the python FUSE driver tests. I don't know how feasible or worthwhile that is, though.

Actions

Copy link

#24

Updated by Brett Smith 4 months ago

Lisa Knox wrote in #note-23:

No, I misunderstood the comment, or rather, the retraction. I returned it to a public function. The part I wanted to push back on is the idea of passing the writer in and having FormatMetrics itself write to stdout. We already test that the logging happens, so it makes sense to me to independently test that FormatMetrics formats the metrics, which I like because FormatMetrics is a somewhat complicated function.

So, I do think the way Tom retracted and revised his comment was, er, pretty terse and a little confusing. But I understand the suggestion is to replace the original:

func formatMetrics(currentMetrics, previousMetrics map[string]float64, intervalSeconds float64) []string

with:

func writeMetrics(out io.Writer, currentMetrics, previousMetrics map[string]float64, intervalSeconds float64) error

Then instead of building a slice, the function writes strings directly to out. If any write returns an error, it can immediately return that back.

Tests can pass in a strings.Builder so they don't have to try to intercept stdout writes or any other magic like that. Then, to me at least, writing checks against the builder seems about the same effort as writing checks against the slice. (If it's still helpful to have a slice, it's easy enough to make one with strings.Split(builder.String(), "\n").

Actually, looking at it again, what we would ideally do is to test that each individual operation populates the correct values in the correct line in the logs. For example, mount the fs, then create a directory, then check that mkdir was called in the log lines. As far as I can tell, that is not what happens in the python FUSE driver tests. I don't know how feasible or worthwhile that is, though.

I think where I'm at with this right now is, it's a good idea and I'm interested in doing it, but given where this ticket is and that there's other active development on the Go FUSE driver right now, I would rather have it be a separate ticket.

I think it would be reasonable to set up a single mount, then do one operation, check the logs, do a different operations, check the logs, etc. This would reduce overhead significantly. It would add some risk that one bad operation ruins all the following tests, but overall I think it's a reasonable trade-off. Does that seem doable?

Actions

Copy link

#25

Updated by Tom Clegg 4 months ago

IMO

FormatMetrics should be formatMetrics. We aren't planning for anyone outside this package to use it. (I meant to retract only the part about making it a method of *keepFS, not the part about making it private.)
formatMetrics should take an io.Writer argument instead of returning a slice of strings, because it's idiomatic Go and it's more efficient. Tests should write to a bytes.Buffer (or strings.Builder) and split on newline.

test that each individual operation populates the correct values in the correct line in the logs

I think it would be reasonable to set up a single mount, then do one operation, check the logs, do a different operations, check the logs, etc. This would reduce overhead significantly.

We should be testing this in fs_test.go, which doesn't actually set up a fuse mount, so overhead isn't an issue.

This test doesn't need to check logs (which is super inconvenient because logs are produced asynchronously on a timer, and is redundant because we already test the logging parts in TestFormatMetrics and TestCrunchstatLogger). It just needs to call gatherMetrics and check the returned map.

That said, I'm not sure we need to prioritize an exhaustive version of this test. We could start with a simple version that writes and reads back a file and confirms the metrics have non-zero entries for bytes written, cache hits, and open/write/read ops. That will catch the most likely bugs. Maybe this will even use our new feature to mount a temp collection -- s.fs.root.MountTmp("tmp") -- which should be more convenient/tidy than our usual approach of explicitly creating a test collection or using a test fixture.

Actions

Copy link

#26

Updated by Tom Clegg 4 months ago

We don't really need the t0 variable, we could just do this, right?

-       t0 := time.Now()
-       defer fs.reportMetrics("open", t0, nil)
+       defer fs.reportMetrics("open", time.Now(), nil)

For read and write, I think it would be more correct to increment the operation count even when the number of bytes read/written is zero:

-       t0 := time.Now()
-       defer func() {
-               if n > 0 {
-                       fs.reportMetrics("read", t0, &n)
-               }
-       }()
+       defer fs.reportMetrics("read", time.Now(), &n)

Thanks for adding the missing defer fs.debugPanics() in Fsyncdir.

Actions

Copy link

#27

Updated by Lisa Knox 3 months ago

23245-fuse-logging @ e1f41c11b1cbe637895df4ef7c4567db4ddd326f

developer-run-tests: #4967

✅ All agreed upon points are implemented / addressed.
- formatMetrics is now writeMetrics and writes directly to an io.Writer
- I chose to keep the existing writeMetrics functionality of compiling all the strings and then writing them out, as this allows me to write the error check only once.
- Added a test for getherMetrics()
Anything not implemented (discovered or discussed during work) has a follow-up story.
- We need to nail down how exhaustive the tests need to be and either include them here or on a separate ticket.
✅ Code is tested and passing, both automated and manual, what manual testing was done is described
✅ The tested code incorporates recent main branch changes.
n/a New or changed UX/UX and has gotten feedback from stakeholders.
n/a Documentation has been updated.
✅ Behaves appropriately at the intended scale (describe intended scale).
✅ Considered backwards and forwards compatibility issues between client and server.
- no change
✅ Follows our coding standards and GUI style guidelines.

Actions

Copy link

#28

Updated by Brett Smith 3 months ago

Target version changed from Development 2025-12-10 to Development 2026-01-06

Actions

Copy link

#29

Updated by Tom Clegg 3 months ago

Lisa Knox wrote in #note-27:

I chose to keep the existing writeMetrics functionality of compiling all the strings and then writing them out, as this allows me to write the error check only once.

Since the caller is ignoring the returned error anyway, I think we should just call fmt.Fprintf(out, ...) and not bother checking/returning errors from that.

We need to nail down how exhaustive the tests need to be and either include them here or on a separate ticket.

I think the minimum we should have on this branch is a check for a non-zero statistic in TestCrunchstatLogger. Something like: write a 2KiB file before the 20 ms sleep, and check logs for crunchstat: blkio:0:0 2048 write and crunchstat: fuseop:open 1 count instead of just crunchstat.

In TestWriteMetrics, we could sort the written lines and then compare to a pre-sorted slice of expected lines using DeepEquals. I think the way that gets reported in a failure will be easier to make sense of.

Gocheck comes with a NotNil checker that you can use in place of Not(IsNil).

The trailing tab here looks like maybe a debugging thing that wasn't meant to be in the final version?

       lines = append(lines, fmt.Sprintf("crunchstat: net:keep0 %.0f tx %.0f rx -- interval %.4f seconds %.0f tx %.0f rx       ",

       c.Check(lines1[0], Matches, "crunchstat: net:keep0 1024 tx 2048 rx -- interval 1.0000 seconds 1024 tx 2048 rx.*")
       c.Check(lines1[1], Equals, "crunchstat: keepcalls 5 put 10 get -- interval 1.0000 seconds 5 put 10 get")

Actions

Copy link

#30

Updated by Lisa Knox 3 months ago · Edited

23245-fuse-logging @ 8d91b8ab2263d7087bbf05986816b3693c345c78

developer-run-tests: #4980

Since the caller is ignoring the returned error anyway, I think we should just call fmt.Fprintf(out, ...) and not bother checking/returning errors from that.

Done

I think the minimum we should have on this branch is a check for a non-zero statistic in TestCrunchstatLogger. Something like: write a 2KiB file before the 20 ms sleep, and check logs for crunchstat: blkio:0:0 2048 write and crunchstat: fuseop:open 1 count instead of just crunchstat.

Done

In TestWriteMetrics, we could sort the written lines and then compare to a pre-sorted slice of expected lines using DeepEquals.

This makes it much simpler, thank you.

Gocheck comes with a NotNil checker...
The trailing tab here...

Done and done

Actions

Copy link

#31

Updated by Tom Clegg 3 months ago

LGTM, thanks. (test failure is just #23142)

Actions

Copy link

#32

Updated by Lisa Knox 3 months ago

Status changed from In Progress to Resolved

Applied in changeset arvados|96600b7ede39f0a14a08d8f28c2e614fc21787de.

Actions

Copy link

#33

Updated by Brett Smith about 2 months ago

Release set to 84

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #23245

Go FUSE driver reports crunchstats

Updated by Brett Smith 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Tom Clegg 5 months ago

Updated by Lisa Knox 4 months ago

Updated by Tom Clegg 4 months ago · Edited

Updated by Brett Smith 4 months ago

Updated by Lisa Knox 4 months ago

Updated by Lisa Knox 4 months ago

Updated by Tom Clegg 4 months ago

Updated by Lisa Knox 4 months ago · Edited

Updated by Tom Clegg 4 months ago

Updated by Tom Clegg 4 months ago

Updated by Lisa Knox 4 months ago · Edited

Updated by Tom Clegg 4 months ago

Updated by Tom Clegg 4 months ago

Updated by Brett Smith 4 months ago

Updated by Lisa Knox 4 months ago

Updated by Brett Smith 4 months ago

Updated by Brett Smith 4 months ago

Updated by Brett Smith 4 months ago

Updated by Brett Smith 4 months ago

Updated by Lisa Knox 4 months ago · Edited

Updated by Brett Smith 4 months ago

Updated by Tom Clegg 4 months ago

Updated by Tom Clegg 4 months ago

Updated by Lisa Knox 3 months ago

Updated by Brett Smith 3 months ago

Updated by Tom Clegg 3 months ago

Updated by Lisa Knox 3 months ago · Edited

Updated by Tom Clegg 3 months ago

Updated by Lisa Knox 3 months ago

Updated by Brett Smith about 2 months ago