Bug #19414
closedkeep-balance panic: concurrent map read and map write
100%
Description
It looks like this can happen when there is an "unachievable" block (referenced by a collection, but not returned by any keepstore index).
fatal error: concurrent map read and map write goroutine 33154 [running]: runtime.throw({0x9ca24b, 0x0}) /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc125a09768 sp=0xc125a09738 pc=0x437151 runtime.mapaccess1_faststr(0x927060, 0xc125a09870, {0xc106a34bd0, 0x29}) /usr/local/go/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc125a097d0 sp=0xc125a09768 pc=0x414245 main.(*BlockStateMap).get(...) /arvados/services/keep-balance/block_state.go:91 main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc86d7ee000, 0x9f8bc, 0x1010101}, {0xc0f75aeb00, 0x1, 0xc320947560}) /arvados/services/keep-balance/block_state.go:154 +0x217 fp=0xc125a09b08 sp=0xc125a097d0 pc=0x8be617 main.(*Balancer).updateCollections.func2() /arvados/services/keep-balance/collection.go:187 +0x41a fp=0xc125a09fa8 sp=0xc125a09b08 pc=0x8c099a main.goSendErr.func1() /arvados/services/keep-balance/collection.go:256 +0x26 fp=0xc125a09fe0 sp=0xc125a09fa8 pc=0x8c1866 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc125a09fe8 sp=0xc125a09fe0 pc=0x46a1c1 created by main.goSendErr /arvados/services/keep-balance/collection.go:255 +0x77 goroutine 33104 [runnable]: main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc867b66000, 0x947a5, 0xc012cf1500}, {0xc0f0864880, 0x1, 0x1}) /arvados/services/keep-balance/block_state.go:161 +0x49b main.(*Balancer).updateCollections.func2() /arvados/services/keep-balance/collection.go:187 +0x41a main.goSendErr.func1() /arvados/services/keep-balance/collection.go:256 +0x26 created by main.goSendErr /arvados/services/keep-balance/collection.go:255 +0x77 goroutine 33157 [runnable]: main.(*BlockStateMap).GetConfirmedReplication(0xc42ba0c330, {0xc8614be000, 0x9e192, 0x1}, {0xc0e7fd1840, 0x1, 0x4146f9}) /arvados/services/keep-balance/block_state.go:172 +0x2ec main.(*Balancer).updateCollections.func2() /arvados/services/keep-balance/collection.go:187 +0x41a main.goSendErr.func1() /arvados/services/keep-balance/collection.go:256 +0x26 created by main.goSendErr /arvados/services/keep-balance/collection.go:255 +0x77
Files
Updated by Tom Clegg over 2 years ago
19414-keep-balance-panic @ 58266ff0dc0420cd99c4cb024476115a3dd9b5e7 -- developer-run-tests: #3270
(wb1 failed)
Updated by Lucas Di Pentima over 2 years ago
Although I understand the fix, I don't get the "...when NumCPU > 2" comment, why is that necessary for the bug to happen?
LGTM, thanks!
Updated by Tom Clegg over 2 years ago
when NumCPU > 2
GetConfirmedReplication() is called from a goroutine that starts 1x per iteration of this loop in source:services/keep-balance/collection.go:
// Use about 1 goroutine per 2 CPUs. Based on experiments with // a 2-core host, using more concurrent database // calls/transactions makes this process slower, not faster. for i := 0; i < runtime.NumCPU()+1/2; i++ {
...which, now that I'm paying attention, means NumCPU+0, but was surely meant to be
for i := 0; i < (runtime.NumCPU()+1)/2; i++ {
...which means that we have only a single goroutine calling GetConfirmedReplication, and therefore no opportunity for a concurrent map read/write, on a single-core machine (or a 2-core machine, after fixing the parens).
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados-private:commit:arvados|69158dc93fdfec57279ba227f872f3a7c01c4e78.
Updated by Tom Clegg over 2 years ago
- File keep-balance keep-balance added
19414-backport-2.3.2 @ c3268207eeb57686c29a0f846bba1a7e6f135622