Re: issue with query optimizer when joining two partitioned tables

From: Anish Kejariwal <anishkej(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: issue with query optimizer when joining two partitioned tables
Date: 2011-07-10 15:33:59
Message-ID: CAOpcnr-xdAUq7+aeqOSYYK=HdOUY6L+H7dF=8O3qP+7Td9th1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks Tom and Heikki! I really appreciate your help.

I went ahead and loaded all the data. In the icream table, I now have ~175
partitions, each with 4041866 records.

The data finished loading 12 hours ago, and I then ran the same query I gave
you guys, and it took 25 seconds since it used the wrong execution plan as
expected.

HashAggregate (cost=27680.90..28045.81 rows=11228 width=41) (actual
time=24769.190..24817.618 rows=11028 loops=1)
-> Hash Join (cost=1304.04..18901.88 rows=292634 width=41) (actual
time=3938.965..24688.718 rows=11028 loops=1)
Hash Cond: (rc.widget_id = r.widget_id)
-> Append (cost=0.00..12110.95 rows=292634 width=24) (actual
time=2854.925..22887.638 rows=309579 loops=1)
-> Seq Scan on icecream rc (cost=0.00..25.60 rows=1
width=24) (actual time=0.003..0.003 rows=0 loops=1)
Filter: ((widgetset_id = 5) AND (dataset_id = 283))
-> Index Scan using icecream_part_283_widgetset_id_idx on
icecream_part_283 rc (cost=0.00..12085.35 rows=292633 width=24) (actual
time=2854.915..2
1784.769 rows=309579 loops=1)
Index Cond: (widgetset_id = 5)
Filter: (dataset_id = 283)
-> Hash (cost=1163.69..1163.69 rows=11228 width=21) (actual
time=1083.704..1083.704 rows=11028 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 604kB
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=528.216..1066.659 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75
rows=1 width=48) (actual time=528.017..528.017 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=527.995..527.995 rows=0
loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using
widget_part_5_widget_widget_type_id_idx on widget_part_5 r
(cost=0.00..1150.94 rows=11227 width=21) (actual time=0.191..512.847 rows=1
1028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
Total runtime: 24844.016 ms
(21 rows)

I then changed my enable* options to force it to use a nested loop:
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;

HashAggregate (cost=460004.79..460369.70 rows=11228 width=41) (actual
time=298.014..341.822 rows=11028 loops=1)
-> Nested Loop (cost=4.28..338747.85 rows=4041898 width=41) (actual
time=0.175..248.529 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=0.053..42.532 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.014..0.014 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual
time=0.032..18.410 rows=11028 lo
ops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..29.99 rows=6 width=24) (actual
time=0.009..0.012 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 283)
-> Index Scan using icecream_part_283_widget_id_idx on
icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual
time=0.004..0.006 rows=1 loo
ps=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 283)
Total runtime: 361.180 ms
(19 rows)

The query was nice and fast as expected.

I then restored the enable* options back to default. The query was slow
again and taking around ~19 seconds. So, this gives us some information
about whether autoanalyze is running in the background:
-I don't think waiting is making a difference. I loaded the data 12 hours
ago, and then ran the query and the query was very slow. I then ran the
query again but for a different dataset and it was also slow. It was only
once I changed the enable* parameters could I get my expected performance.
-I would have tried select * from ice-cream again, but with 176 partitions
this query is no longer feasible. But, I do agree that select * from
icecream is causing the statistics to be updated.

So, then I followed Tom's suggestion to defeat the empty-table heuristic.

select relpages from pg_class where relname = 'icecream';

relpages
----------
0
(1 row)

Ok, so the planner thinks that the parent table is empty. I then ran:
update pg_class set relpages = 1 where relname = 'icecream';

HashAggregate (cost=201199.27..201564.18 rows=11228 width=41) (actual
time=277.195..304.620 rows=11028 loops=1)
-> Nested Loop (cost=4.28..79942.45 rows=4041894 width=41) (actual
time=0.227..231.181 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=0.125..40.834 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.022..0.022 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual
time=0.100..18.964 rows=11028 lo
ops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..6.99 rows=2 width=24) (actual
time=0.008..0.012 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..0.00 rows=1
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 283)
-> Index Scan using icecream_part_283_widget_id_idx on
icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual
time=0.004..0.006 rows=1 loo
ps=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 283)
Total runtime: 318.634 ms
(19 rows)

Wow! that fixes it. Thanks you so much!!!! I've been struggling with this
issue for 2-3 days. (Also, in the past, I've seen inconsistent performance
with this query, which may be the result of the planner sometimes choosing
the wrong plan, but I'll chase that down later).

Tom said: But maybe we should reconsider the heuristic for tables that are
members of inheritance trees --- particularly parents of inheritance trees.

I agree. I think postgres should get updated to take this into account. I
shouldn't have to set the relpages to 1 for all the empty parent tables that
I have partitioned. Should I file this as a bug/enhancement?

Also, do I need to worry about about autoanalyze/autovacuum setting back
relpages to zero for the parent icecream table?

thanks!!!
Anish

On Sat, Jul 9, 2011 at 10:43 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > On 09.07.2011 00:36, Anish Kejariwal wrote:
> >> My guess as to what happened:
> >> -because the icecream parent table has zero records, the query optimizer
> >> chooses the incorrect execution plan
> >> -when I do select * from icecream, the optimizer now knows how many
> records
> >> are really in the icecream table, by knowing that the icecream table has
> >> partitions.
>
> > "select * from icecream" won't have any direct effect on the
> > optimization of subsequent queries. What probably happened is that
> > autoanalyze ran in the background while you ran that select, and
> > analyzed some of the partitions. Simply waiting a while would've had the
> > same effect.
>
> Yeah. Also, the reason that a manual vacuum on icecream changes things
> yet again is that in 9.0 and up, we have a notion of summary stats
> across the whole inheritance tree, but autoanalyze hasn't been taught to
> gather those. The manual command on the parent table does gather them,
> though.
>
> So what's happening here is that we suddenly have an accurate idea of
> the size of the join product as a result of having inheritance summary
> stats to estimate with, and that drives the estimated cost of the merge
> or hash join down out of the stratosphere. The estimated cost of the
> nestloop goes down a lot too, but not as much.
>
> I experimented with a similar case here, and it seems like a lot of the
> remaining error in the nestloop estimate comes from this:
>
> >> -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5 width=24) (actual
> time=0.002..0.002 rows=0 loops=1)
> >> Filter: (dataset_id = 281)
>
> The indexscan on the nonempty child partition is estimated at less than
> 10 cost units, so this is a *large* fraction of what the planner sees as
> the per-outer-row cost of a nestloop. And with more than 11000 rows on
> the other side of the join, that discourages it from using the nestloop.
> In reality of course this takes negligible time compared to examining
> the child partition.
>
> Now why is the seqscan cost estimate so large, when actually the parent
> icecream table is totally empty? It's because the planner has been
> taught to never believe that an empty table is empty. If memory serves,
> it's really estimating on an assumption that the table contains 10 pages
> and some corresponding number of rows. This is a reasonable defensive
> posture when dealing with ordinary tables, I think, since most likely
> if the catalogs say the table is empty that's just a leftover from when
> it was created. But maybe we should reconsider the heuristic for tables
> that are members of inheritance trees --- particularly parents of
> inheritance trees.
>
> I was able to defeat the empty-table heuristic here by doing
>
> update pg_class set relpages = 1 where relname = 'icecream';
>
> and then I started getting much more realistic estimates in my test
> case. (It still wanted to use a merge join initially, but after
> knocking down random_page_cost it went to the nestloop.) It would
> be interesting to see what sorts of results Anish gets with that.
>
> regards, tom lane
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jeremy Harris 2011-07-10 15:46:25 Re: issue with query optimizer when joining two partitioned tables
Previous Message Radhya sahal 2011-07-10 13:08:01 query total time im milliseconds