Feature #17574: keep-balance updates API server with correct storage_classes_confirmed - Arvados

After a balancing iteration, keep-balance updates replication_confirmed, replication_confirmed_at, storage_classes_confirmed, storage_classes_confirmed_at fields in the collections table.

replication_confirmed is clamped at replication_desired (or, if that's null, DefaultReplication) -- avoids excessive updates when block replication increases temporarily due to rendezvous shuffling or higher-replication collections using the same blocks
the ..._confirmed_at timestamps are updated only when the ..._confirmed fields change -- avoids excessive updates when nothing meaningful is changing
all updates are done directly to the database rather than going through controller/railsAPI -- much faster, and avoids messing with the modified_at/updated_at timestamps and writing a huge number of rows in the logs table (note this means no websocket notification event for these updates either)
now that keep-balance has database access, it seemed like a good time to update the "fetch all collections" code to use one long postgresql select query instead of getting pages from controller/railsAPI and going through some agony to avoid missing any
- with that change, the bottleneck changed from "get collections" to "parse manifests"
- profiling revealed the manifest parsing code spent 70% of its time allocating/copying/GCing strings; replacing the string operations with byte-slice operations made it waay faster
- testing on su92l (766K rows) "get all collections" took 2000s before, 90s after this change; now the bottleneck is getting block lists from keepstores (250s)
applying the updates is a bit slow: about 40 updates/s on a 2-core machine, 120 updates/s on an 8-core machine. They are committed in batches of 100 updates per transaction; doing them individually with autocommit was about half as fast
there's a configurable limit (Collections.BalanceUpdateLimit) for # updates per iteration, default 100K. A site with >100K collections will need multiple passes to apply all the updates after upgrading to this version -- and thanks to the "get all collections" speedup, even when it's doing 100K updates this version will still complete a cycle faster than the old version

Actions

Copy link

Updated by Peter Amstutz over 3 years ago

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

Actions

Copy link

Updated by Lucas Di Pentima over 3 years ago

At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?
I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?
Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?

Actions

Copy link

Updated by Tom Clegg over 3 years ago

Peter Amstutz wrote:

By 40 updates/s do you mean 40 collections per second or 40 batches of 100 (4000 collections/s) ?

40 collections per second.

Actions

Copy link

Updated by Tom Clegg over 3 years ago

Lucas Di Pentima wrote:

At file balance.go:L436 we use all the available cores to parse collection manifests and I'm assuming keep-balance is usually run in the same node as other services (ie: keepproxy, as recommended by the docs), should we leave some cores for other off-process purposes? or maybe add a config knob so that the admin can tune it?

Leaving CPUs for other processes seems like a good idea. But I kinda think "nice" in the systemd unit file would be a better way...

I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

Each run keep-balance produces a replication level distribution chart, do you think it would be useful to do the same for storage classes?

Probably, but it's not obvious (at least to me) what would be the most useful report... maybe we need a more general version of #15758 (trash reporting) where we can get ideas together. IOW not inclined to do it as part of this issue.

Actions

Copy link

Updated by Lucas Di Pentima over 3 years ago

Tom Clegg wrote:

I've manually tested it on an arvbox instance and tried adding an inexistent storage class to a collection, after running keep-balance this class was added to the _confirmed field seemingly without any warning or error message on the logs. Even removing the legal storage class and leaving only the fake one didn't produce any errors, is this an expected behavior?

Hm, that sounds like a bug, or at least misleading. I'm guessing it left you with replication_confirmed==0, storage_classes_confirmed==["bogus"]? In that case we should probably change that to replication_confirmed==0, storage_classes_confirmed==[] or null.

Yes, that was what happened, replication_confirmed was left at 0, makes sense to leave storage_classes_confirmed as empy list or null.

Actions

Copy link

#10