Re: issue with query optimizer when joining two partitioned tables

Lists: pgsql-performance
From: Anish Kejariwal <anishkej(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: issue with query optimizer when joining two partitioned tables
Date: 2011-07-08 21:36:48
Message-ID: CAOpcnr_nx2OAoyjJknUdeJsvFame=_K0bfKgzTjVq4G+=7yxZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I have run into issue where the query optimizer is choosing the wrong
execution plan when I'm trying to join two large tables that have been
partitioned. I would really appreciate it if someone could help me out
this. I don't know whether I've found a bug in the optimizer, or whether
there is some parameter/option I need to set in postgres. Below, I've
included my execution plans. I'm using postgres 9.0.3, and I'm running this
on a pretty beefy Linux server.

My two tables:
-widget: has 4041866 records, and is broken up into 4 partitions (no records
are in the parent table).
-icecream: I'm starting with zero records, but since this there could be
billions of ice-cream records, I will partition and will not have any
records in the parent table.

So, then I then create my first partition in icecream table, and load
4041866 records into it.

Here is the query I'm using to join the two tables:
explain analyze
SELECT
r.widget_id,
r.widget_type_id,
avg(rc.cost)::double precision cost_avg
FROM
widget r,
icecream rc
WHERE
r.widget_type_id = 4
and r.widgetset_id = 5
AND r.widget_id = rc.widget_id
and rc.dataset_id = 281
group by r.widget_id,r.chromosome, r.start_pos, r.end_pos,r.widget_type_id
;

Here is the corresponding execution plan:

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=147262.20..147299.12 rows=1136 width=41) (actual
time=31876.290..31904.880 rows=11028 loops=1)
-> Merge Join (cost=95574.83..112841.79 rows=1147347 width=41) (actual
time=31130.870..31832.922 rows=11028 loops=1)
Merge Cond: (r.widget_id = rc.widget_id)
-> Sort (cost=1913.89..1942.27 rows=11352 width=21) (actual
time=56.818..68.701 rows=11028 loops=1)
Sort Key: r.widget_id
Sort Method: quicksort Memory: 1246kB
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.139..40.513 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75
rows=1 width=48) (actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.023..0.023 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using
widget_part_5_widget_widget_type_id_idx on widget_part_5 r
(cost=0.00..1136.55 rows=11351 width=21) (actual time=0.106..18.489
rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Sort (cost=93660.94..93711.47 rows=20214 width=24) (actual
time=29730.522..30766.354 rows=946140 loops=1)
Sort Key: rc.widget_id
Sort Method: external sort Disk: 165952kB
-> Append (cost=0.00..92215.33 rows=20214 width=24) (actual
time=0.057..13731.204 rows=4041866 loops=1)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (dataset_id = 281)
-> Seq Scan on icecream_part_281 rc
(cost=0.00..92192.33 rows=20209 width=24) (actual time=0.051..5427.730
rows=4041866 loops=1)
Filter: (dataset_id = 281)
Total runtime: 33182.945 ms
(24 rows)

The query is doing a merge join, is taking 33 seconds, but should take less
than a second. So, then I do: select * from icecream;

Now, when I run the same query again, I get a different and correct
execution plan (nested loop), and the query takes less than 1 second as I
would expect.

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=7223611.41..7223648.33 rows=1136 width=41) (actual
time=392.822..420.166 rows=11028 loops=1)
-> Nested Loop (cost=4.28..341195.22 rows=229413873 width=41) (actual
time=0.231..331.800 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.051..50.181 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1136.55 rows=11351 width=21) (actual
time=0.033..21.254 rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..29.88 rows=6 width=24) (actual
time=0.014..0.018 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 281)
-> Index Scan using icecream_part_281_widget_id_idx on
icecream_part_281 rc (cost=0.00..6.88 rows=1 width=24) (actual
time=0.009..0.010 rows=1 loops=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 281)
Total runtime: 431.935 ms
(19 rows)

My guess as to what happened:
-because the icecream parent table has zero records, the query optimizer
chooses the incorrect execution plan
-when I do select * from icecream, the optimizer now knows how many records
are really in the icecream table, by knowing that the icecream table has
partitions.

Next, if I run vacuum analyze on the parent table, I again get a wrong/slow
execution plan (this time it uses the hash join). Again, I think this is
because the parent table itself has zero records.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=191926.03..191962.95 rows=1136 width=41) (actual
time=28967.567..28994.395 rows=11028 loops=1)
-> Hash Join (cost=166424.79..191585.47 rows=11352 width=41) (actual
time=28539.196..28917.830 rows=11028 loops=1)
Hash Cond: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
time=0.054..54.068 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1136.55 rows=11351 width=21) (actual
time=0.035..22.419 rows=11028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Hash (cost=92214.73..92214.73 rows=4041823 width=24) (actual
time=28438.419..28438.419 rows=4041866 loops=1)
Buckets: 524288 Batches: 2 Memory Usage: 118449kB
-> Append (cost=0.00..92214.73 rows=4041823 width=24)
(actual time=0.020..14896.908 rows=4041866 loops=1)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (dataset_id = 281)
-> Seq Scan on icecream_part_281 rc
(cost=0.00..92191.73 rows=4041818 width=24) (actual time=0.012..5718.592
rows=4041866 loops=1)
Filter: (dataset_id = 281)
Total runtime: 29007.937 ms
(20 rows)

select * from icecream does not fix this issue.

I could of course disable hash join and merge join to force postgres to use
a nested loop, but my system is often joining these two tables, and I'd
rather not have to set this in every single place.
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;

thanks in advance!!!

Anish


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Anish Kejariwal <anishkej(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: issue with query optimizer when joining two partitioned tables
Date: 2011-07-09 07:54:08
Message-ID: 4E180920.8040806@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 09.07.2011 00:36, Anish Kejariwal wrote:
> My guess as to what happened:
> -because the icecream parent table has zero records, the query optimizer
> chooses the incorrect execution plan
> -when I do select * from icecream, the optimizer now knows how many records
> are really in the icecream table, by knowing that the icecream table has
> partitions.

"select * from icecream" won't have any direct effect on the
optimization of subsequent queries. What probably happened is that
autoanalyze ran in the background while you ran that select, and
analyzed some of the partitions. Simply waiting a while would've had the
same effect.

> Next, if I run vacuum analyze on the parent table, I again get a wrong/slow
> execution plan (this time it uses the hash join). Again, I think this is
> because the parent table itself has zero records.
>
>
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=191926.03..191962.95 rows=1136 width=41) (actual
> time=28967.567..28994.395 rows=11028 loops=1)
> -> Hash Join (cost=166424.79..191585.47 rows=11352 width=41) (actual
> time=28539.196..28917.830 rows=11028 loops=1)
> Hash Cond: (r.widget_id = rc.widget_id)
> -> Append (cost=4.28..1149.30 rows=11352 width=21) (actual
> time=0.054..54.068 rows=11028 loops=1)
> -> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
> width=48) (actual time=0.013..0.013 rows=0 loops=1)
> Recheck Cond: (widgetset_id = 5)
> Filter: (widget_type_id = 4)
> -> Bitmap Index Scan on widget_widgetset_id_idx
> (cost=0.00..4.28 rows=4 width=0) (actual time=0.007..0.007 rows=0 loops=1)
> Index Cond: (widgetset_id = 5)
> -> Index Scan using widget_part_5_widget_widget_type_id_idx
> on widget_part_5 r (cost=0.00..1136.55 rows=11351 width=21) (actual
> time=0.035..22.419 rows=11028 loops=1)
> Index Cond: (widget_type_id = 4)
> Filter: (widgetset_id = 5)
> -> Hash (cost=92214.73..92214.73 rows=4041823 width=24) (actual
> time=28438.419..28438.419 rows=4041866 loops=1)
> Buckets: 524288 Batches: 2 Memory Usage: 118449kB
> -> Append (cost=0.00..92214.73 rows=4041823 width=24)
> (actual time=0.020..14896.908 rows=4041866 loops=1)
> -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
> width=24) (actual time=0.002..0.002 rows=0 loops=1)
> Filter: (dataset_id = 281)
> -> Seq Scan on icecream_part_281 rc
> (cost=0.00..92191.73 rows=4041818 width=24) (actual time=0.012..5718.592
> rows=4041866 loops=1)
> Filter: (dataset_id = 281)
> Total runtime: 29007.937 ms
> (20 rows)

The cost estimates in the above slow plan are pretty accurate, so I
suspect the cost estimates for the fast plan are not, or the planner
would choose that.

> I could of course disable hash join and merge join to force postgres to use
> a nested loop, but my system is often joining these two tables, and I'd
> rather not have to set this in every single place.
> set enable_mergejoin=off;
> set enable_hashjoin=off;
> set enable_nestloop = on;

Can you do explain analyze with these settings? That might give us a
clue on where it's going wrong.

Also, I suspect that when you load more data into icecream, the planner
might start to pick the faster plan, because the seqscan on icecream
will start to look more expensive compared to the index scan and nested
loop join in the faster plan.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Anish Kejariwal <anishkej(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: issue with query optimizer when joining two partitioned tables
Date: 2011-07-09 17:43:50
Message-ID: 7806.1310233430@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 09.07.2011 00:36, Anish Kejariwal wrote:
>> My guess as to what happened:
>> -because the icecream parent table has zero records, the query optimizer
>> chooses the incorrect execution plan
>> -when I do select * from icecream, the optimizer now knows how many records
>> are really in the icecream table, by knowing that the icecream table has
>> partitions.

> "select * from icecream" won't have any direct effect on the
> optimization of subsequent queries. What probably happened is that
> autoanalyze ran in the background while you ran that select, and
> analyzed some of the partitions. Simply waiting a while would've had the
> same effect.

Yeah. Also, the reason that a manual vacuum on icecream changes things
yet again is that in 9.0 and up, we have a notion of summary stats
across the whole inheritance tree, but autoanalyze hasn't been taught to
gather those. The manual command on the parent table does gather them,
though.

So what's happening here is that we suddenly have an accurate idea of
the size of the join product as a result of having inheritance summary
stats to estimate with, and that drives the estimated cost of the merge
or hash join down out of the stratosphere. The estimated cost of the
nestloop goes down a lot too, but not as much.

I experimented with a similar case here, and it seems like a lot of the
remaining error in the nestloop estimate comes from this:

>> -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5 width=24) (actual time=0.002..0.002 rows=0 loops=1)
>> Filter: (dataset_id = 281)

The indexscan on the nonempty child partition is estimated at less than
10 cost units, so this is a *large* fraction of what the planner sees as
the per-outer-row cost of a nestloop. And with more than 11000 rows on
the other side of the join, that discourages it from using the nestloop.
In reality of course this takes negligible time compared to examining
the child partition.

Now why is the seqscan cost estimate so large, when actually the parent
icecream table is totally empty? It's because the planner has been
taught to never believe that an empty table is empty. If memory serves,
it's really estimating on an assumption that the table contains 10 pages
and some corresponding number of rows. This is a reasonable defensive
posture when dealing with ordinary tables, I think, since most likely
if the catalogs say the table is empty that's just a leftover from when
it was created. But maybe we should reconsider the heuristic for tables
that are members of inheritance trees --- particularly parents of
inheritance trees.

I was able to defeat the empty-table heuristic here by doing

update pg_class set relpages = 1 where relname = 'icecream';

and then I started getting much more realistic estimates in my test
case. (It still wanted to use a merge join initially, but after
knocking down random_page_cost it went to the nestloop.) It would
be interesting to see what sorts of results Anish gets with that.

regards, tom lane


From: Anish Kejariwal <anishkej(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: issue with query optimizer when joining two partitioned tables
Date: 2011-07-10 15:33:59
Message-ID: CAOpcnr-xdAUq7+aeqOSYYK=HdOUY6L+H7dF=8O3qP+7Td9th1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks Tom and Heikki! I really appreciate your help.

I went ahead and loaded all the data. In the icream table, I now have ~175
partitions, each with 4041866 records.

The data finished loading 12 hours ago, and I then ran the same query I gave
you guys, and it took 25 seconds since it used the wrong execution plan as
expected.

HashAggregate (cost=27680.90..28045.81 rows=11228 width=41) (actual
time=24769.190..24817.618 rows=11028 loops=1)
-> Hash Join (cost=1304.04..18901.88 rows=292634 width=41) (actual
time=3938.965..24688.718 rows=11028 loops=1)
Hash Cond: (rc.widget_id = r.widget_id)
-> Append (cost=0.00..12110.95 rows=292634 width=24) (actual
time=2854.925..22887.638 rows=309579 loops=1)
-> Seq Scan on icecream rc (cost=0.00..25.60 rows=1
width=24) (actual time=0.003..0.003 rows=0 loops=1)
Filter: ((widgetset_id = 5) AND (dataset_id = 283))
-> Index Scan using icecream_part_283_widgetset_id_idx on
icecream_part_283 rc (cost=0.00..12085.35 rows=292633 width=24) (actual
time=2854.915..2
1784.769 rows=309579 loops=1)
Index Cond: (widgetset_id = 5)
Filter: (dataset_id = 283)
-> Hash (cost=1163.69..1163.69 rows=11228 width=21) (actual
time=1083.704..1083.704 rows=11028 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 604kB
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=528.216..1066.659 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75
rows=1 width=48) (actual time=528.017..528.017 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=527.995..527.995 rows=0
loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using
widget_part_5_widget_widget_type_id_idx on widget_part_5 r
(cost=0.00..1150.94 rows=11227 width=21) (actual time=0.191..512.847 rows=1
1028 loops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
Total runtime: 24844.016 ms
(21 rows)

I then changed my enable* options to force it to use a nested loop:
set enable_mergejoin=off;
set enable_hashjoin=off;
set enable_nestloop = on;

HashAggregate (cost=460004.79..460369.70 rows=11228 width=41) (actual
time=298.014..341.822 rows=11028 loops=1)
-> Nested Loop (cost=4.28..338747.85 rows=4041898 width=41) (actual
time=0.175..248.529 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=0.053..42.532 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.014..0.014 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual
time=0.032..18.410 rows=11028 lo
ops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..29.99 rows=6 width=24) (actual
time=0.009..0.012 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..23.00 rows=5
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 283)
-> Index Scan using icecream_part_283_widget_id_idx on
icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual
time=0.004..0.006 rows=1 loo
ps=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 283)
Total runtime: 361.180 ms
(19 rows)

The query was nice and fast as expected.

I then restored the enable* options back to default. The query was slow
again and taking around ~19 seconds. So, this gives us some information
about whether autoanalyze is running in the background:
-I don't think waiting is making a difference. I loaded the data 12 hours
ago, and then ran the query and the query was very slow. I then ran the
query again but for a different dataset and it was also slow. It was only
once I changed the enable* parameters could I get my expected performance.
-I would have tried select * from ice-cream again, but with 176 partitions
this query is no longer feasible. But, I do agree that select * from
icecream is causing the statistics to be updated.

So, then I followed Tom's suggestion to defeat the empty-table heuristic.

select relpages from pg_class where relname = 'icecream';

relpages
----------
0
(1 row)

Ok, so the planner thinks that the parent table is empty. I then ran:
update pg_class set relpages = 1 where relname = 'icecream';

HashAggregate (cost=201199.27..201564.18 rows=11228 width=41) (actual
time=277.195..304.620 rows=11028 loops=1)
-> Nested Loop (cost=4.28..79942.45 rows=4041894 width=41) (actual
time=0.227..231.181 rows=11028 loops=1)
Join Filter: (r.widget_id = rc.widget_id)
-> Append (cost=4.28..1163.69 rows=11228 width=21) (actual
time=0.125..40.834 rows=11028 loops=1)
-> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1
width=48) (actual time=0.022..0.022 rows=0 loops=1)
Recheck Cond: (widgetset_id = 5)
Filter: (widget_type_id = 4)
-> Bitmap Index Scan on widget_widgetset_id_idx
(cost=0.00..4.28 rows=4 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (widgetset_id = 5)
-> Index Scan using widget_part_5_widget_widget_type_id_idx
on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual
time=0.100..18.964 rows=11028 lo
ops=1)
Index Cond: (widget_type_id = 4)
Filter: (widgetset_id = 5)
-> Append (cost=0.00..6.99 rows=2 width=24) (actual
time=0.008..0.012 rows=1 loops=11028)
-> Seq Scan on icecream rc (cost=0.00..0.00 rows=1
width=24) (actual time=0.001..0.001 rows=0 loops=11028)
Filter: (rc.dataset_id = 283)
-> Index Scan using icecream_part_283_widget_id_idx on
icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual
time=0.004..0.006 rows=1 loo
ps=11028)
Index Cond: (rc.widget_id = r.widget_id)
Filter: (rc.dataset_id = 283)
Total runtime: 318.634 ms
(19 rows)

Wow! that fixes it. Thanks you so much!!!! I've been struggling with this
issue for 2-3 days. (Also, in the past, I've seen inconsistent performance
with this query, which may be the result of the planner sometimes choosing
the wrong plan, but I'll chase that down later).

Tom said: But maybe we should reconsider the heuristic for tables that are
members of inheritance trees --- particularly parents of inheritance trees.

I agree. I think postgres should get updated to take this into account. I
shouldn't have to set the relpages to 1 for all the empty parent tables that
I have partitioned. Should I file this as a bug/enhancement?

Also, do I need to worry about about autoanalyze/autovacuum setting back
relpages to zero for the parent icecream table?

thanks!!!
Anish

On Sat, Jul 9, 2011 at 10:43 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > On 09.07.2011 00:36, Anish Kejariwal wrote:
> >> My guess as to what happened:
> >> -because the icecream parent table has zero records, the query optimizer
> >> chooses the incorrect execution plan
> >> -when I do select * from icecream, the optimizer now knows how many
> records
> >> are really in the icecream table, by knowing that the icecream table has
> >> partitions.
>
> > "select * from icecream" won't have any direct effect on the
> > optimization of subsequent queries. What probably happened is that
> > autoanalyze ran in the background while you ran that select, and
> > analyzed some of the partitions. Simply waiting a while would've had the
> > same effect.
>
> Yeah. Also, the reason that a manual vacuum on icecream changes things
> yet again is that in 9.0 and up, we have a notion of summary stats
> across the whole inheritance tree, but autoanalyze hasn't been taught to
> gather those. The manual command on the parent table does gather them,
> though.
>
> So what's happening here is that we suddenly have an accurate idea of
> the size of the join product as a result of having inheritance summary
> stats to estimate with, and that drives the estimated cost of the merge
> or hash join down out of the stratosphere. The estimated cost of the
> nestloop goes down a lot too, but not as much.
>
> I experimented with a similar case here, and it seems like a lot of the
> remaining error in the nestloop estimate comes from this:
>
> >> -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5 width=24) (actual
> time=0.002..0.002 rows=0 loops=1)
> >> Filter: (dataset_id = 281)
>
> The indexscan on the nonempty child partition is estimated at less than
> 10 cost units, so this is a *large* fraction of what the planner sees as
> the per-outer-row cost of a nestloop. And with more than 11000 rows on
> the other side of the join, that discourages it from using the nestloop.
> In reality of course this takes negligible time compared to examining
> the child partition.
>
> Now why is the seqscan cost estimate so large, when actually the parent
> icecream table is totally empty? It's because the planner has been
> taught to never believe that an empty table is empty. If memory serves,
> it's really estimating on an assumption that the table contains 10 pages
> and some corresponding number of rows. This is a reasonable defensive
> posture when dealing with ordinary tables, I think, since most likely
> if the catalogs say the table is empty that's just a leftover from when
> it was created. But maybe we should reconsider the heuristic for tables
> that are members of inheritance trees --- particularly parents of
> inheritance trees.
>
> I was able to defeat the empty-table heuristic here by doing
>
> update pg_class set relpages = 1 where relname = 'icecream';
>
> and then I started getting much more realistic estimates in my test
> case. (It still wanted to use a merge join initially, but after
> knocking down random_page_cost it went to the nestloop.) It would
> be interesting to see what sorts of results Anish gets with that.
>
> regards, tom lane
>


From: Jeremy Harris <jgh(at)wizmail(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: issue with query optimizer when joining two partitioned tables
Date: 2011-07-10 15:46:25
Message-ID: 4E19C951.1050109@wizmail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On 2011-07-09 18:43, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 09.07.2011 00:36, Anish Kejariwal wrote:
>>> My guess as to what happened:
>>> -because the icecream parent table has zero records, the query optimizer
>>> chooses the incorrect execution plan
>>> -when I do select * from icecream, the optimizer now knows how many records
>>> are really in the icecream table, by knowing that the icecream table has
>>> partitions.
>
>> "select * from icecream" won't have any direct effect on the
>> optimization of subsequent queries. What probably happened is that
>> autoanalyze ran in the background while you ran that select, and
>> analyzed some of the partitions. Simply waiting a while would've had the
>> same effect.
>
> Yeah. Also, the reason that a manual vacuum on icecream changes things
> yet again is that in 9.0 and up, we have a notion of summary stats
> across the whole inheritance tree, but autoanalyze hasn't been taught to
> gather those. The manual command on the parent table does gather them,
> though.

Is stats-gathering significantly more expensive than an FTS? Could an FTS
update stats as a matter of course (or perhaps only if enough changes in table)?
--
Jeremy