Bug #18346
closedLogin federation: request storm overwhelming login cluster rails api server
100%
Description
A customer has seen this behavior in 2 different scenarios:
a) when a user used an old token that was issued by a local cluster prior to the migration to a login federation. Local cluster and login cluster on Arvados 2.2.2
b) when a big workflow is run on a 2.3.0 cluster with the login cluster on 2.2.2
The b) case appears to be a 2.3 regression: the workflow that triggered the outage is a re-run that did not cause problems on Arvados 2.2.x (or older, that's not clear).
The requests that end up at the login cluster api server have a specific request parameter pattern (include_trash=true&select=[uuid]). They seem to be user and collection requests.
The collection requests seem to be for log collections (i.e. the workflow steps writing to them, presumably?).
The requests all get a 401 response from the login cluster api server, but this does not appear to impede the running of the big workflow on the local cluster.
The customer implemented a workaround: greatly increasing the number of passenger workers on the login cluster api server made it able to handle many more concurrent requests (and return a 401 for them), which avoids the overload death spiral when clients retry.
Updated by Ward Vandewege about 3 years ago
- Subject changed from Request storm overwhelming federation to Request storm overwhelming login federation
Updated by Ward Vandewege about 3 years ago
- Subject changed from Request storm overwhelming login federation to Login federation: request storm overwhelming login cluster rails api server
Updated by Peter Amstutz about 3 years ago
- Target version changed from 2021-11-10 sprint to 2021-11-24 sprint
Updated by Tom Clegg almost 3 years ago
- Related to Bug #18887: [federation] wb1 fiddlesticks in login federation added