Story #13697
closed
Prevent the API server and database from continuing to serve requests to clients after timeout
Added by Joshua Randall over 6 years ago.
Updated about 3 years ago.
Estimated time:
(Total: 0.00 h)
Release relationship:
Auto
Description
A typical nginx configuration for the API server is to time out requests after 5m (300s). However, passenger/rails currently do not notice that the client has given up on the request and continue to process it. On some occasions (when things go wrong in other ways) we have had requests running for hours, holding locks that prevent other parts of the system from working, and in the worst cases, effecting a DoS on the entire API server by consuming all available passenger workers in the process.
There is no point in processing any part of a request for longer than the request timeout. An easy win to prevent the above scenario would be to set a postgres statement timeout to the same length as the nginx gateway timeout.
Our system has been running with a 300s statement_timeout for a few weeks without issue: https://github.com/wtsi-hgi/arvados/commit/d9728e17148db53caf1f16fce032448c3d5c1432
Probably the value used for the timeout should come from config rather than being hard-coded, so that admins can configure it appropriately when non-standard nginx configuration is used.
- Target version set to To Be Groomed
- Target version changed from To Be Groomed to Arvados Future Sprints
- Target version deleted (
Arvados Future Sprints)
I am once again running into problems with API server resource starvation due to long-running requests causing normal clients with reasonable timeouts and retry behaviour essentially denial-of-service attacking the server (with, in this case, large collection creation requests).
IMHO it is not possible to run a stable Arvados API server under heavy load without applying a patch like the one that was implemented at Sanger to introduce statement timeouts in the database to mitigate this issue (see: https://github.com/wtsi-hgi/hgi-systems/blob/master/ansible/roles/arvados-master/files/hgi-integration-1.1.4.20180723133344-8.patch). Apologies that the OP for this issue did not include the patch!
Definitions of "heavy load" may vary depending on the capabilities of the API server and database hosts (and passenger settings such as the number of workers), but based on prior experience, the sorts of things that may take a long time (longer than the timeout) and lead to client-retry DoS attacks are:
- attempts to create very large collections (I think the thing that takes a long time is validating manifest block signatures)
- attempts to read very large collections with signed manifests (unsigned manifests are fast)
- attempts to work with container requests with large inline input documents
- certain queries against the log table, when it has grown very large
- Target version set to 2021-09-29 sprint
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
13697-database-timeout @ fdd3dd304439a7912db1c6ef257cf512ec7b3b52 -- developer-run-tests: #2707
- Set the database statement and lock timeouts to RequestTimeout (similar to the contributed patch, but implemented in the "db connection checkout" hook instead of the ArvadosModel initializer which runs each time an object is created/loaded)
- In Go services that use lib/service (including controller and keepstore, but not yet keepproxy or keep-web), the request context is cancelled if the handler doesn't finish or hijack the connection before reaching RequestTimeout.
- When keep-web gets migrated to use lib/service, this won't cut off long/slow downloads (the webdav handler doesn't interrupt copy-bytes-to-ResponseWriter when context cancels) but we might (?) need some more special handling so large/slow uploads are allowed to take longer than RequestTimeout.
- Remove httpserver.HandlerWithContext in favor of the BaseContext feature already provided by http.Server.
Tom Clegg wrote:
13697-database-timeout @ fdd3dd304439a7912db1c6ef257cf512ec7b3b52 -- developer-run-tests: #2707
- Set the database statement and lock timeouts to RequestTimeout (similar to the contributed patch, but implemented in the "db connection checkout" hook instead of the ArvadosModel initializer which runs each time an object is created/loaded)
- In Go services that use lib/service (including controller and keepstore, but not yet keepproxy or keep-web), the request context is cancelled if the handler doesn't finish or hijack the connection before reaching RequestTimeout.
- When keep-web gets migrated to use lib/service, this won't cut off long/slow downloads (the webdav handler doesn't interrupt copy-bytes-to-ResponseWriter when context cancels) but we might (?) need some more special handling so large/slow uploads are allowed to take longer than RequestTimeout.
For this, would it be worth putting a note along these lines (with a reference to this ticket) in the webdav source for when we refactor it to use lib/service? That way it won't be overlooked.
- Remove httpserver.HandlerWithContext in favor of the BaseContext feature already provided by http.Server.
Nice!
I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?
LGTM otherwise!
Ward Vandewege wrote:
I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?
Hm, the same failure happened to an #8363 branch as well, and went away on the next attempt.
executor_test.go:166:
c.Check(code, Equals, expectCode)
... obtained int = 151
... expected int = 0
151 is SIGURG. This seems to be a singularity bug, reported in 3.5.2 and fixed April 2020. The Go runtime (starting at/before 1.14) sends itself SIGURG to preempt goroutines, and these are forwarded to child processes by default.
Perhaps the fix is to update arvados-server install
from singularity 3.5.2 to 3.7.4 (source:tools/compute-images/scripts/base.sh already uses 3.7.4).
Added #18184
Tom Clegg wrote:
Ward Vandewege wrote:
I see a singularity related test failure at developer-run-tests-remainder: #2814 /consoleFull, is that unrelated?
Hm, the same failure happened to an #8363 branch as well, and went away on the next attempt.
[...]
151 is SIGURG. This seems to be a singularity bug, reported in 3.5.2 and fixed April 2020. The Go runtime (starting at/before 1.14) sends itself SIGURG to preempt goroutines, and these are forwarded to child processes by default.
Perhaps the fix is to update arvados-server install
from singularity 3.5.2 to 3.7.4 (source:tools/compute-images/scripts/base.sh already uses 3.7.4).
Good sleuthing! Yes please upgrade to 3.7.4, that is the minimum version we are intending to support.
- Related to Bug #18184: Flaky test due to bug in old version of singularity added
- Status changed from In Progress to Resolved
Also available in: Atom
PDF