Bug #13924
closedTimeout when getting collection by portable data hash
100%
Description
All of our workflows are failing because attempts to fetch certain collections by portable data hash take longer than the (now raised to 10m) timeout.
This appears to be entirely because of the database query time:
{"method":"GET","path":"/arvados/v1/collections/3184b10f225a48f1c79cbb29dc2ab9c7%2B333031","format":"json","controller":"Arvados::V1::CollectionsController","action":"show","status":422,"duration":600051.83,"view":2.99,"db":600028.24,"request_id":"req-m410ox3fssr8bj84uayd","client_ipaddr":"10.101.0.160","client_auth":"eglyx-gj3su-jiyd09zx73wfs1w","params":{"alt":"json"},"@timestamp":"2018-07-26T15:53:10.953384161Z","@version":"1","message":"[422] GET /arvados/v1/collections/3184b10f225a48f1c79cbb29dc2ab9c7%2B333031 (Arvados::V1::CollectionsController#show)"}
The db transaction is (from debug logging):
User Load (16.9ms) SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" IN (1) ContainerRequest Load (34.4ms) SELECT (octet_length(container_requests.mounts)) as read_length FROM "container_requests" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (container_requests.owner_uuid = target_uuid)) ) AND ((container_requests.uuid in ('eglyx-xvhdp-qxf113lsdqne8s3','eglyx-xvhdp-u63ooxa2kiho9gq','eglyx-xvhdp-a48hnrbqm5w2xtv','eglyx-xvhdp-sge8hvtds9gd1d9','eglyx-xvhdp-16q264tt9juw9s6','eglyx-xvhdp-cotew5i0g5u1uns','eglyx-xvhdp-k2x7jcifa95apkv','eglyx-xvhdp-mcm4c0hrtg06de7','eglyx-xvhdp-hrrk5wtlsxqauil','eglyx-xvhdp-n56l02h8ae95uhx','eglyx-xvhdp-e6qezgekf0qcn5v'))) ORDER BY container_requests.modified_at desc, container_requests.uuid LIMIT 100 OFFSET 0 ApiClient Load (16.1ms) SELECT "api_clients".* FROM "api_clients" WHERE "api_clients"."id" IN (0) Container Load (24.0ms) SELECT "containers"."uuid", "containers"."priority" FROM "containers" WHERE (auth_uuid='eglyx-gj3su-kdkqdibyw00vmpe') ORDER BY "containers"."id" ASC LIMIT 1 ApiClientAuthorization Load (22.4ms) SELECT "api_client_authorizations".* FROM "api_client_authorizations" WHERE (api_token='lyoqbyam7w6dioc8a882k5ew2i2y8r0amdtzcbvnhk0jjud6a6' and (expires_at is null or expires_at > CURRENT_TIMESTAMP)) ORDER BY "api_client_authorizations"."id" ASC LIMIT 1 ContainerRequest Load (41.5ms) SELECT "container_requests".* FROM "container_requests" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (container_requests.owner_uuid = target_uuid)) ) AND ((container_requests.uuid in ('eglyx-xvhdp-qxf113lsdqne8s3','eglyx-xvhdp-u63ooxa2kiho9gq','eglyx-xvhdp-a48hnrbqm5w2xtv','eglyx-xvhdp-sge8hvtds9gd1d9','eglyx-xvhdp-16q264tt9juw9s6','eglyx-xvhdp-cotew5i0g5u1uns','eglyx-xvhdp-k2x7jcifa95apkv','eglyx-xvhdp-mcm4c0hrtg06de7','eglyx-xvhdp-hrrk5wtlsxqauil','eglyx-xvhdp-n56l02h8ae95uhx','eglyx-xvhdp-e6qezgekf0qcn5v'))) ORDER BY container_requests.modified_at desc, container_requests.uuid LIMIT 100 OFFSET 0 User Load (14.7ms) SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" IN (1) Collection Load (600041.0ms) SELECT "collections".* FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND "collections"."portable_data_hash" = $1 ORDER BY trash_at desc LIMIT 1 [["portable_data_hash", "12d154802d181fcd1a5cae44ec8b3f82+19533"]]
As you can see, the failure is due to the "Collection Load" query taking longer than the 10m timeout.
I tried running this manually in the database with an explain analyze:
arvados_api_production=# explain analyze SELECT "collections".* FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner _uuid = target_uuid)) AND collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533' ORDER BY trash_at desc LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..515.68 rows=1 width=1282) (actual time=1187700.538..1187700.538 rows=1 loops=1) -> Nested Loop Anti Join (cost=0.43..1686403.08 rows=3273 width=1282) (actual time=1187700.536..1187700.536 rows=1 loops=1) Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text) -> Index Scan Backward using index_collections_on_trash_at on collections (cost=0.43..1686347.05 rows=3637 width=1282) (actual time=1187700.506..1187700.506 rows=1 loops=1) Filter: ((NOT is_trashed) AND ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text)) Rows Removed by Filter: 4117580 -> Materialize (cost=0.00..1.48 rows=1 width=82) (actual time=0.024..0.024 rows=0 loops=1) -> Seq Scan on materialized_permission_view (cost=0.00..1.48 rows=1 width=82) (actual time=0.020..0.020 rows=0 loops=1) Filter: (trashed = 1) Rows Removed by Filter: 38 Planning time: 0.434 ms Execution time: 1187700.604 ms (12 rows)
I then tried without the subselect, but that wasn't the problem:
arvados_api_production=# explain analyze SELECT "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533' ORDER BY trash_at desc LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..463.76 rows=1 width=1282) (actual time=1099196.070..1099196.070 rows=1 loops=1) -> Index Scan Backward using index_collections_on_trash_at on collections (cost=0.43..1685103.05 rows=3637 width=1282) (actual time=1099196.068..1099196.068 rows=1 loops=1) Filter: ((NOT is_trashed) AND ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text)) Rows Removed by Filter: 4073842 Planning time: 0.275 ms Execution time: 1099196.114 ms (6 rows)
Without the 'order by' clause the same query is ~2000000x faster (0.567ms vs 1099s):
arvados_api_production=# explain analyze SELECT "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533' LIMIT 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.56..3.05 rows=1 width=1282) (actual time=0.528..0.529 rows=1 loops=1) -> Index Scan using index_collections_on_portable_data_hash on collections (cost=0.56..9047.93 rows=3637 width=1282) (actual time=0.527..0.527 rows=1 loops=1) Index Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text) Filter: (NOT is_trashed) Planning time: 0.283 ms Execution time: 0.567 ms (6 rows)
Alternatively, if order is important, keeping the 'order by trash_at desc' but adding an extra sort column (uuid) improves on the original query by 250x:
arvados_api_production=# explain analyze SELECT "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533' ORDER BY trash_at desc, uuid LIMIT 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=8759.36..8759.36 rows=1 width=1282) (actual time=4281.976..4281.977 rows=1 loops=1) -> Sort (cost=8759.36..8768.45 rows=3637 width=1282) (actual time=4281.975..4281.975 rows=1 loops=1) Sort Key: trash_at DESC, uuid Sort Method: top-N heapsort Memory: 26kB -> Bitmap Heap Scan on collections (cost=118.87..8741.17 rows=3637 width=1282) (actual time=23.846..4277.658 rows=6460 loops=1) Recheck Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text) Filter: (NOT is_trashed) Heap Blocks: exact=6024 -> Bitmap Index Scan on index_collections_on_portable_data_hash (cost=0.00..117.97 rows=4454 width=0) (actual time=22.723..22.723 rows=6460 loops=1) Index Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text) Planning time: 0.304 ms Execution time: 4282.079 ms (12 rows)
autovacuum is running, but I'll try doing an explicit `vacuum analyze collections` and see if that helps the original query.