Project

General

Profile

Actions

Bug #13924

closed

Timeout when getting collection by portable data hash

Added by Joshua Randall over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
08/20/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

All of our workflows are failing because attempts to fetch certain collections by portable data hash take longer than the (now raised to 10m) timeout.

This appears to be entirely because of the database query time:

{"method":"GET","path":"/arvados/v1/collections/3184b10f225a48f1c79cbb29dc2ab9c7%2B333031","format":"json","controller":"Arvados::V1::CollectionsController","action":"show","status":422,"duration":600051.83,"view":2.99,"db":600028.24,"request_id":"req-m410ox3fssr8bj84uayd","client_ipaddr":"10.101.0.160","client_auth":"eglyx-gj3su-jiyd09zx73wfs1w","params":{"alt":"json"},"@timestamp":"2018-07-26T15:53:10.953384161Z","@version":"1","message":"[422] GET /arvados/v1/collections/3184b10f225a48f1c79cbb29dc2ab9c7%2B333031 (Arvados::V1::CollectionsController#show)"}

The db transaction is (from debug logging):

 User Load (16.9ms)  SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" IN (1)
  ContainerRequest Load (34.4ms)  SELECT  (octet_length(container_requests.mounts)) as read_length FROM "container_requests" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (container_requests.owner_uuid = target_uuid)) ) AND ((container_requests.uuid in ('eglyx-xvhdp-qxf113lsdqne8s3','eglyx-xvhdp-u63ooxa2kiho9gq','eglyx-xvhdp-a48hnrbqm5w2xtv','eglyx-xvhdp-sge8hvtds9gd1d9','eglyx-xvhdp-16q264tt9juw9s6','eglyx-xvhdp-cotew5i0g5u1uns','eglyx-xvhdp-k2x7jcifa95apkv','eglyx-xvhdp-mcm4c0hrtg06de7','eglyx-xvhdp-hrrk5wtlsxqauil','eglyx-xvhdp-n56l02h8ae95uhx','eglyx-xvhdp-e6qezgekf0qcn5v')))  ORDER BY container_requests.modified_at desc, container_requests.uuid LIMIT 100 OFFSET 0
  ApiClient Load (16.1ms)  SELECT "api_clients".* FROM "api_clients" WHERE "api_clients"."id" IN (0)
  Container Load (24.0ms)  SELECT  "containers"."uuid", "containers"."priority" FROM "containers" WHERE (auth_uuid='eglyx-gj3su-kdkqdibyw00vmpe')  ORDER BY "containers"."id" ASC LIMIT 1
  ApiClientAuthorization Load (22.4ms)  SELECT  "api_client_authorizations".* FROM "api_client_authorizations" WHERE (api_token='lyoqbyam7w6dioc8a882k5ew2i2y8r0amdtzcbvnhk0jjud6a6' and (expires_at is null or expires_at > CURRENT_TIMESTAMP))  ORDER BY "api_client_authorizations"."id" ASC LIMIT 1
  ContainerRequest Load (41.5ms)  SELECT  "container_requests".* FROM "container_requests" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (container_requests.owner_uuid = target_uuid)) ) AND ((container_requests.uuid in ('eglyx-xvhdp-qxf113lsdqne8s3','eglyx-xvhdp-u63ooxa2kiho9gq','eglyx-xvhdp-a48hnrbqm5w2xtv','eglyx-xvhdp-sge8hvtds9gd1d9','eglyx-xvhdp-16q264tt9juw9s6','eglyx-xvhdp-cotew5i0g5u1uns','eglyx-xvhdp-k2x7jcifa95apkv','eglyx-xvhdp-mcm4c0hrtg06de7','eglyx-xvhdp-hrrk5wtlsxqauil','eglyx-xvhdp-n56l02h8ae95uhx','eglyx-xvhdp-e6qezgekf0qcn5v')))  ORDER BY container_requests.modified_at desc, container_requests.uuid LIMIT 100 OFFSET 0
  User Load (14.7ms)  SELECT "users".* FROM "users" WHERE (redirect_to_user_uuid is null) AND "users"."id" IN (1)
  Collection Load (600041.0ms)  SELECT  "collections".* FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND "collections"."portable_data_hash" = $1  ORDER BY trash_at desc LIMIT 1  [["portable_data_hash", "12d154802d181fcd1a5cae44ec8b3f82+19533"]]

As you can see, the failure is due to the "Collection Load" query taking longer than the 10m timeout.

I tried running this manually in the database with an explain analyze:

arvados_api_production=# explain analyze SELECT  "collections".* FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner
_uuid = target_uuid)) AND collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533'  ORDER BY trash_at desc LIMIT 1;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..515.68 rows=1 width=1282) (actual time=1187700.538..1187700.538 rows=1 loops=1)
   ->  Nested Loop Anti Join  (cost=0.43..1686403.08 rows=3273 width=1282) (actual time=1187700.536..1187700.536 rows=1 loops=1)
         Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
         ->  Index Scan Backward using index_collections_on_trash_at on collections  (cost=0.43..1686347.05 rows=3637 width=1282) (actual time=1187700.506..1187700.506 rows=1 loops=1)
               Filter: ((NOT is_trashed) AND ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text))
               Rows Removed by Filter: 4117580
         ->  Materialize  (cost=0.00..1.48 rows=1 width=82) (actual time=0.024..0.024 rows=0 loops=1)
               ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=82) (actual time=0.020..0.020 rows=0 loops=1)
                     Filter: (trashed = 1)
                     Rows Removed by Filter: 38
 Planning time: 0.434 ms
 Execution time: 1187700.604 ms
(12 rows)

I then tried without the subselect, but that wasn't the problem:

arvados_api_production=# explain analyze SELECT  "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533'  ORDER BY trash_at desc LIMIT 1;
                                                                                    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..463.76 rows=1 width=1282) (actual time=1099196.070..1099196.070 rows=1 loops=1)
   ->  Index Scan Backward using index_collections_on_trash_at on collections  (cost=0.43..1685103.05 rows=3637 width=1282) (actual time=1099196.068..1099196.068 rows=1 loops=1)
         Filter: ((NOT is_trashed) AND ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text))
         Rows Removed by Filter: 4073842
 Planning time: 0.275 ms
 Execution time: 1099196.114 ms
(6 rows)

Without the 'order by' clause the same query is ~2000000x faster (0.567ms vs 1099s):

arvados_api_production=# explain analyze SELECT  "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533' LIMIT 1;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..3.05 rows=1 width=1282) (actual time=0.528..0.529 rows=1 loops=1)
   ->  Index Scan using index_collections_on_portable_data_hash on collections  (cost=0.56..9047.93 rows=3637 width=1282) (actual time=0.527..0.527 rows=1 loops=1)
         Index Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text)
         Filter: (NOT is_trashed)
 Planning time: 0.283 ms
 Execution time: 0.567 ms
(6 rows)

Alternatively, if order is important, keeping the 'order by trash_at desc' but adding an extra sort column (uuid) improves on the original query by 250x:

arvados_api_production=# explain analyze SELECT  "collections".* FROM "collections" WHERE (collections.is_trashed = false) AND "collections"."portable_data_hash" = '12d154802d181fcd1a5cae44ec8b3f82+19533'  ORDER BY trash_at desc, uuid LIMIT 1;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8759.36..8759.36 rows=1 width=1282) (actual time=4281.976..4281.977 rows=1 loops=1)
   ->  Sort  (cost=8759.36..8768.45 rows=3637 width=1282) (actual time=4281.975..4281.975 rows=1 loops=1)
         Sort Key: trash_at DESC, uuid
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Bitmap Heap Scan on collections  (cost=118.87..8741.17 rows=3637 width=1282) (actual time=23.846..4277.658 rows=6460 loops=1)
               Recheck Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text)
               Filter: (NOT is_trashed)
               Heap Blocks: exact=6024
               ->  Bitmap Index Scan on index_collections_on_portable_data_hash  (cost=0.00..117.97 rows=4454 width=0) (actual time=22.723..22.723 rows=6460 loops=1)
                     Index Cond: ((portable_data_hash)::text = '12d154802d181fcd1a5cae44ec8b3f82+19533'::text)
 Planning time: 0.304 ms
 Execution time: 4282.079 ms
(12 rows)

autovacuum is running, but I'll try doing an explicit `vacuum analyze collections` and see if that helps the original query.


Subtasks 1 (0 open1 closed)

Task #14048: Review 13924-collections-indexResolvedPeter Amstutz08/20/2018

Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Story #14004: [Controller] Keep database in tuneNew

Actions
Actions

Also available in: Atom PDF