Bug #13199
closedpostgres deadlock detected during node ping
Description
We regularly have errors such as this in our postgres logs:
2018-03-08 07:14:02 UTC [51147-51] arvados@arvados_api_production ERROR: deadlock detected
2018-03-08 07:14:02 UTC [51147-52] arvados@arvados_api_production DETAIL: Process 51147 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 1638
5; blocked by process 38926.
Process 38926 waits for ShareLock on transaction 16631317; blocked by process 50888.
Process 50888 waits for ShareLock on transaction 16631449; blocked by process 38998.
Process 38998 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 16385; blocked by process 51147.
Process 51147: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 38926: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 50888: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 38998: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
2018-03-08 07:14:02 UTC [51147-53] arvados@arvados_api_production HINT: See server log for query details.
2018-03-08 07:14:02 UTC [51147-54] arvados@arvados_api_production STATEMENT: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "no
des"."id" = $4
2018-03-08 07:14:02 UTC [50197-103] arvados@arvados_api_production ERROR: deadlock detected
2018-03-08 07:14:02 UTC [50197-104] arvados@arvados_api_production DETAIL: Process 50197 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 163
85; blocked by process 38926.
Process 38926 waits for ShareLock on transaction 16631317; blocked by process 50888.
Process 50888 waits for ShareLock on transaction 16631449; blocked by process 38998.
Process 38998 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 16385; blocked by process 50197.
Process 50197: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 38926: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 50888: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
Process 38998: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "nodes"."id" = $4
2018-03-08 07:14:02 UTC [50197-105] arvados@arvados_api_production HINT: See server log for query details.
2018-03-08 07:14:02 UTC [50197-106] arvados@arvados_api_production STATEMENT: UPDATE "nodes" SET "ip_address" = $1, "updated_at" = $2, "modified_at" = $3 WHERE "n
odes"."id" = $4
2018-03-08 07:14
These appear to align with arvados-node-ping cron jobs:
Mar 8 07:14:02 arvados-compute-node-ncucu-005 arvados-node-ping: Errors: [ Mar 8 07:14:02 arvados-compute-node-ncucu-005 arvados-node-ping: "#<PG::TRDeadlockDetected: ERROR: deadlock detected\nDETAIL: Process 38956 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 16385; blocked by process 38926.\nProcess 38926 waits for ShareLock on transaction 16631317; blocked by process 50888.\nProcess 50888 waits for ShareLock on transaction 16631449; blocked by process 38998.\nProcess 38998 waits for ExclusiveLock on tuple (16,2) of relation 16544 of database 16385; blocked by process 38956.\nHINT: See server log for query details.\n>" Mar 8 07:14:02 arvados-compute-node-ncucu-005 arvados-node-ping: ]
Updated by Tom Morris almost 8 years ago
What is the frequency of these errors? Is regularly once a week or 10% of the pings or ...?
Updated by Joshua Randall almost 8 years ago
We have just experienced an API server outage that appears to be related to this bug, in which all of our nginx workers got stuck in a node ping deadlock and sat there for 480s not handling any other requests (not sure where the 480s timeout came from but they seem to have given up at that point). Then, passenger "handled" a few requests in the queue, although they were all so old that it was pointless to do so, and since our node pings happen every two minutes, by that time every compute node had at least 3 more node pings in the request queue do the workers soon came across them and deadlocked again.
I'm going to decrease the frequency of node pings to 10m in an attempt to workaround this issue. Perhaps they should not be set lower than the request timeout to avoid triggering locking bugs. I could also add locking on the client side to avoid running multiple node pings at once.