Bug #18346: Login federation: request storm overwhelming login cluster rails api server - Arvados

Bug #18346

For recording findings related to https://support.curii.com/rt/Ticket/Display.html?id=254 

 A customer has seen this behavior in 2 different scenarios: 

 a) when a user used an old token that was issued by a local cluster prior to the migration to a login federation. Local cluster and login cluster on Arvados 2.2.2 
 b) when a big workflow is run on a 2.3.0 cluster with the login cluster on 2.2.2 

 The b) case appears to be a 2.3 regression: the workflow that triggered the outage is a re-run that did not cause problems on Arvados 2.2.x (or older, that's not clear). 

 The requests that end up at the login cluster api server have a specific request parameter pattern (include_trash=true&select=[uuid]). They seem to be user and collection requests. 

 The collection requests seem to be for log collections (i.e. the workflow steps writing to them, presumably?). 

 The requests all get a 401 response from the login cluster api server, but this does not appear to impede the running of the big workflow on the local cluster. 

 The customer implemented a workaround: greatly increasing the number of passenger workers on the login cluster api server made it able to handle many more concurrent requests (and return a 401 for them), which avoids the overload death spiral when clients retry.

Back

Project

General

Profile

Arvados

Bug #18346