Bug #8289
closed[API] Avoid making queries unnecessarily slow by adding superfluous "order by" columns
100%
Description
As an example, when the Python SDK's websocket fallback PollClient starts up, it does an API request with order="id asc". API server translates that to a PostgreSQL query like this:
arvados_production=# SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc, logs.modified_at desc, logs.uuid LIMIT 1 OFFSET 0; arvados_production=# explain analyze SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc, logs.modified_at desc, logs.uuid LIMIT 1 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=822081.89..822081.89 rows=1 width=1290) (actual time=4007.367..4007.368 rows=1 loops=1) -> Sort (cost=822081.89..823218.65 rows=454703 width=1290) (actual time=4007.364..4007.364 rows=1 loops=1) Sort Key: id, modified_at, uuid Sort Method: top-N heapsort Memory: 27kB -> Bitmap Heap Scan on logs (cost=69925.24..819808.37 rows=454703 width=1290) (actual time=531.924..3086.689 rows=869534 loops=1) Recheck Cond: ((id > 3464274) AND ((event_type)::text = ANY ('{create,update,delete}'::text[]))) -> BitmapAnd (cost=69925.24..69925.24 rows=454703 width=0) (actual time=530.935..530.935 rows=0 loops=1) -> Bitmap Index Scan on logs_pkey (cost=0.00..29223.51 rows=1433428 width=0) (actual time=249.940..249.940 rows=1954308 loops=1) Index Cond: (id > 3464274) -> Bitmap Index Scan on index_logs_on_event_type (cost=0.00..40474.13 rows=1341921 width=0) (actual time=270.632..270.632 rows=1735755 loops=1) Index Cond: ((event_type)::text = ANY ('{create,update,delete}'::text[])) Total runtime: 4007.417 ms
"id" is a primary key, so the "logs.modified_at desc, logs.uuid" part of the "order by" clause can't possibly make a difference to the results. If we remove them, the query looks like this:
arvados_production=# explain analyze SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc LIMIT 1 OFFSET 0; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..8.67 rows=1 width=1290) (actual time=0.049..0.050 rows=1 loops=1) -> Index Scan Backward using logs_pkey on logs (cost=0.00..3943415.80 rows=454703 width=1290) (actual time=0.046..0.046 rows=1 loops=1) Index Cond: (id > 3464274) Filter: ((event_type)::text = ANY ('{create,update,delete}'::text[])) Total runtime: 0.078 ms
Updated by Tom Clegg almost 9 years ago
- Description updated (diff)
- Category set to API
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-02-17 Sprint
Updated by Radhika Chippada almost 9 years ago
Review comments:
I experimented with a few more combinations of order in the query_test and I am not sure if we need more tweaking here.
- If I gave order: ['id asc', 'modified_at asc’], I am ending up with 'logs.id asc, logs.modified_at asc’. Do we want to get rid off the modified_at in this case to ensure clients do not send meaningless yet costly queries?
- If I gave order: ['modified_at asc'], I am getting ‘logs.modified_at asc, logs.modified_at desc, logs.uuid’
- If I used order: ['event_type asc', 'modified_at asc’], I am getting "logs.event_type asc", "logs.modified_at asc", "logs.modified_at desc", "logs.uuid”. In this and the previous case, shouldn’t we get rid off the modified_at before adding the full default list? Or better yet, add only those from the default list that are not specified?
Updated by Tom Clegg almost 9 years ago
Good suggestions, thanks. Updated code and added tests; now at 6f1d385
Updated by Radhika Chippada almost 9 years ago
This is much better. Can we update the comment a bit more. Something like: if id and updated_at are given, updated_at is truncated because after the unique id column ordering, no need to order on updated_at. On the other hand, if log_type and id are given ordering on type and then id is what is happening. And then if I used log_type, id, updated_at, we will use only log_type and id and truncate updated_at since there is no point in using it.
With or without it LGTM. Thanks.
Updated by Tom Clegg almost 9 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:a13c14cfc7fabe4f4da48edd57a086d2d8953a03.