Query plan, nested EXISTS

From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query plan, nested EXISTS
Date: 2012-09-28 21:04:04
Message-ID: CAA2LLOFzQDqnpvxxf1EAUzKXrJx5VUN3P2jgon3zTNzdtpGZQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Howdy, I've been debugging a client's slow query today and I'm curious
about the query plan. It's picking a plan that hashes lots of rows from the
versions table (on v9.0.10)...

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.code ILIKE '%comp%' AND
c.retirement_date IS NULL
) AND
b.retirement_date IS NULL
)

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=833177.30..833177.31 rows=1 width=0) (actual
time=10806.416..10806.416 rows=1 loops=1)
-> Hash Semi Join (cost=747004.15..833154.86 rows=8977 width=0)
(actual time=10709.343..10806.344 rows=894 loops=1)
Hash Cond: (a.id = b.note_id)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=12.213..71.199
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Hash (cost=723749.35..723749.35 rows=1417424 width=4) (actual
time=10696.192..10696.192 rows=227261 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 2016kB
-> Hash Semi Join (cost=620007.75..723749.35 rows=1417424
width=4) (actual time=8953.460..10645.714 rows=227261 loops=1)
Hash Cond: (b.entity_id = c.id)
-> Seq Scan on note_links b (cost=0.00..71849.56
rows=1417424 width=8) (actual time=0.075..628.183 rows=1509795 loops=1)
Filter: ((retirement_date IS NULL) AND
((entity_type)::text = 'Version'::text))
-> Hash (cost=616863.62..616863.62 rows=251530
width=4) (actual time=8953.327..8953.327 rows=300115 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 10551kB
-> Seq Scan on versions c
(cost=0.00..616863.62 rows=251530 width=4) (actual time=176.590..8873.588
rows=300115 loops=1)
Filter: ((retirement_date IS NULL) AND
((code)::text ~~* '%comp%'::text))
Total runtime: 10810.479 ms
(16 rows)

However, I can trick it into a better plan by adding LIMIT 1 into the inner
EXISTS:

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.code ILIKE '%comp%' AND
c.retirement_date IS NULL
LIMIT 1
) AND
b.retirement_date IS NULL
)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=372820.37..372820.38 rows=1 width=0) (actual
time=139.430..139.430 rows=1 loops=1)
-> Nested Loop Semi Join (cost=0.00..372809.15 rows=4488 width=0)
(actual time=9.735..139.333 rows=894 loops=1)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=9.699..67.263
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Index Scan using note_links_note on note_links b
(cost=0.00..24.54 rows=1 width=4) (actual time=0.006..0.006 rows=0
loops=12469)
Index Cond: (b.note_id = a.id)
Filter: ((b.retirement_date IS NULL) AND
((b.entity_type)::text = 'Version'::text) AND (SubPlan 1))
SubPlan 1
-> Limit (cost=0.00..9.04 rows=1 width=0) (actual
time=0.003..0.003 rows=0 loops=11794)
-> Index Scan using versions_pkey on versions c
(cost=0.00..9.04 rows=1 width=0) (actual time=0.003..0.003 rows=0
loops=11794)
Index Cond: (id = $0)
Filter: ((retirement_date IS NULL) AND
((code)::text ~~* '%comp%'::text))
Total runtime: 139.465 ms
(13 rows)

Unfortunately, a couple other queries I tested got slower by adding the
LIMIT so I don't think that's going to be a good workaround. It doesn't
appear to be related to ILIKE, because I tried a straight equals against
another un-indexed column of versions and still get a slow plan (and adding
the LIMIT to this one made it fast too):

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.sg_status_list = 'ip' AND
c.retirement_date IS NULL
) AND
b.retirement_date IS NULL
)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=821544.18..821544.19 rows=1 width=0) (actual
time=5046.492..5046.492 rows=1 loops=1)
-> Hash Semi Join (cost=735371.03..821521.73 rows=8977 width=0)
(actual time=4941.968..5045.968 rows=7116 loops=1)
Hash Cond: (a.id = b.note_id)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=9.639..68.751
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Hash (cost=712116.23..712116.23 rows=1417424 width=4) (actual
time=4931.956..4931.956 rows=297401 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 2633kB
-> Hash Join (cost=620484.32..712116.23 rows=1417424
width=4) (actual time=3362.472..4864.816 rows=297401 loops=1)
Hash Cond: (b.entity_id = c.id)
-> Seq Scan on note_links b (cost=0.00..71849.56
rows=1417424 width=8) (actual time=0.079..622.277 rows=1509795 loops=1)
Filter: ((retirement_date IS NULL) AND
((entity_type)::text = 'Version'::text))
-> Hash (cost=618673.97..618673.97 rows=144828
width=4) (actual time=3362.337..3362.337 rows=155834 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 5479kB
-> HashAggregate (cost=617225.69..618673.97
rows=144828 width=4) (actual time=3289.861..3335.344 rows=155834 loops=1)
-> Seq Scan on versions c
(cost=0.00..616863.62 rows=144828 width=4) (actual time=217.080..3133.870
rows=155834 loops=1)
Filter: ((retirement_date IS NULL)
AND ((sg_status_list)::text = 'ip'::text))
Total runtime: 5051.414 ms
(17 rows)

Does anything come to mind that would help me debug why this plan is being
chosen? Thanks!

Matt

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shaun Thomas 2012-09-28 21:37:40 Re: Possible Performance Regression with Transitive Comparisons vs. Constants
Previous Message Tom Lane 2012-09-28 20:35:25 Re: Possible Performance Regression with Transitive Comparisons vs. Constants