Bug #13513
Updated by Ward Vandewege over 6 years ago
After the merge of 9918-index-timeouts, I'm observing that keep-balance hangs (?) on ComputeChangeSets: <pre> May 22 14:06:37 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:06:37 dhhck-bi6l4-pkwwh8mhe0qgmu6 (keep2.dhhck.arvadosapi.com:25107, s3): done May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): add 1043919 replicas to map May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 zzzzz-ivpuk-v2udip63fnkdyxf (s3:///dhhck-keep-0) on dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 dhhck-bi6l4-oynapdlh4hzydcf (keep0.dhhck.arvadosapi.com:25107, s3): done May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 GetCurrentState: took 10m6.992266703s May 22 14:08:40 dhhck.arvadosapi.com keep-balance[11166]: 2018/05/22 14:08:40 ComputeChangeSets: start </pre> I stopped it after ~42 minutes. <pre> May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopping Arvados Keep Balance... May 22 14:50:02 dhhck.arvadosapi.com systemd[1]: Stopped Arvados Keep Balance. </pre> Command line: <pre> /usr/bin/keep-balance -commit-trash </pre> I also tried with -commit-pull enabled, and the behavior was unchanged. Config file: <pre> # cat /etc/arvados/keep-balance/keep-balance.yml ################################################################### # THIS FILE IS MANAGED BY PUPPET -- CHANGES WILL BE OVERWRITTEN # ################################################################### Client: APIHost: dhhck.arvadosapi.com:443 AuthToken: STRIPPED Insecure: false KeepServiceTypes: - s3 RunPeriod: 14400s CollectionBatchSize: 100000 CollectionBuffers: 1000 </pre> Bisecting: |0.1.20180322172032.41e612b59-1|(with extra patch to increase timeout to 20 minutes)|OK| |1.1.4.20180403215323-1|(with extra patch to increase timeout to 20 minutes)|OK| |1.1.4.20180420195921-1|(with extra patch to increase timeout to 20 minutes)|OK| |1.1.4.20180426154228-1|(with extra patch to increase timeout to 20 minutes)|OK| |1.1.4.20180426193406-1|(with extra patch to increase timeout to 20 minutes)|HANGS| |1.1.4.20180510200716-1|(with extra patch to increase timeout to 20 minutes)|HANGS| |1.1.4.20180518195015-1||HANGS| So, it looks like the problem was introduced between version 1.1.4.20180426154228-1 (commit:fcfbbddf572db32008fcdc7d0750a13b8d6f3b1c) and version 1.1.4.20180426193406-1 (commit:932e3d6e9a899cc662ea3934b79057d39cd88fed).