Quick Links

Re: -HEAD planner issue wrt hash_joins on dbt3 ?

Lists:	pgsql-hackers

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	-HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-11 17:54:48
Message-ID:	4505A2E8.7000509@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

I'm about to do some benchmarking on -HEAD one some hardware I have
available and it seems I'm hitting a rather weird issue causing the osdl
dbt3 benchmark to run very slow and eating CPU time for hours ...

it seems that the issue is caused by the following query:
(in case it gets linewrapped:
http://www.kaltenbrunner.cc/files/dbt3_with_hashjoin.txt)

select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from
customer, orders, lineitem, supplier, nation, region where c_custkey =
o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and
c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey
= r_regionkey and r_name = 'AFRICA' and o_orderdate >= date '1993-01-01'
and o_orderdate < date '1993-01-01' + interval '1 year' group by n_name
order by revenue desc;

that results in the following plan on my box:

Sort (cost=2543391.75..2543391.81 rows=25 width=37)
Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))
-> HashAggregate (cost=2543390.73..2543391.17 rows=25 width=37)
-> Hash Join (cost=440864.81..2543027.40 rows=72666 width=37)
Hash Cond: ((orders.o_custkey = customer.c_custkey) AND
(supplier.s_nationkey = customer.c_nationkey))
-> Hash Join (cost=377714.59..2415568.01 rows=1816643
width=49)
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Nested Loop (cost=13.65..1719683.85
rows=12000672 width=49)
-> Merge Join (cost=0.00..10248.66
rows=20000 width=41)
Merge Cond: (nation.n_nationkey =
supplier.s_nationkey)
-> Nested Loop (cost=0.00..19.19
rows=5 width=33)
-> Index Scan using pk_nation on
nation (cost=0.00..9.38 rows=25 width=37)
-> Index Scan using pk_region on
region (cost=0.00..0.38 rows=1 width=4)
Index Cond:
(nation.n_regionkey = region.r_regionkey)
Filter: (r_name =
'AFRICA'::bpchar)
-> Index Scan using i_s_nationkey on
supplier (cost=0.00..9779.46 rows=100000 width=8)
-> Bitmap Heap Scan on lineitem
(cost=13.65..77.16 rows=665 width=16)
Recheck Cond: (lineitem.l_suppkey =
supplier.s_suppkey)
-> Bitmap Index Scan on i_l_suppkey
(cost=0.00..13.65 rows=665 width=0)
Index Cond: (lineitem.l_suppkey =
supplier.s_suppkey)
-> Hash (cost=372023.51..372023.51 rows=2270971
width=8)
-> Bitmap Heap Scan on orders
(cost=41391.94..372023.51 rows=2270971 width=8)
Recheck Cond: ((o_orderdate >=
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))
-> Bitmap Index Scan on i_o_orderdate
(cost=0.00..41391.94 rows=2270971 width=0)
Index Cond: ((o_orderdate >=
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))
-> Hash (cost=55647.15..55647.15 rows=1500615 width=8)
-> Seq Scan on customer (cost=0.00..55647.15
rows=1500615 width=8)
(27 rows)

so it really thinks that doing hashes with gigantic amounts of data is
a good idea generally - this seems to be independent on work_mem - the
plan looks the same with 1MB vs 126MB(which I had during the run).

the profile of the backend eating the cpu looks similiar to:

26351 27.9047 ExecScanHashBucket
8239 8.7248 hash_seq_search
6984 7.3958 hash_search_with_hash_value

setting hash_join to off results in a runtime of about 2,5minutes:

(http://www.kaltenbrunner.cc/files/dbt3_without_hashjoin.txt)

Sort (cost=3700257.38..3700257.45 rows=25 width=37) (actual
time=286820.962..286820.968 rows=5 loops=1)
Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))
-> HashAggregate (cost=3700256.37..3700256.80 rows=25 width=37)
(actual time=286820.932..286820.941 rows=5 loops=1)
-> Nested Loop (cost=730956.43..3699893.04 rows=72666
width=37) (actual time=43551.767..286488.555 rows=72441 loops=1)
Join Filter: (customer.c_nationkey = supplier.s_nationkey)
-> Merge Join (cost=730956.43..3624153.73 rows=1816643
width=49) (actual time=43281.710..257082.739 rows=1822547 loops=1)
Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Scan using i_l_orderkey on lineitem
(cost=0.00..2715943.34 rows=60003360 width=16) (actual
time=32.868..123668.380 rows=59991868 loops=1)
-> Sort (cost=730956.43..732091.92 rows=454194
width=41) (actual time=43248.797..45754.223 rows=1822547 loops=1)
Sort Key: orders.o_orderkey
-> Merge Join (cost=670885.68..688278.21
rows=454194 width=41) (actual time=34469.359..42050.059 rows=455262 loops=1)
Merge Cond: (customer.c_custkey =
orders.o_custkey)
-> Sort (cost=59105.79..59856.10
rows=300123 width=41) (actual time=8113.826..8491.532 rows=299493 loops=1)
Sort Key: customer.c_custkey
-> Nested Loop
(cost=781.13..31801.81 rows=300123 width=41) (actual
time=107.537..7461.355 rows=299493 loops=1)
-> Nested Loop
(cost=1.06..11.00 rows=5 width=33) (actual time=0.030..0.296 rows=5 loops=1)
Join Filter:
(nation.n_regionkey = region.r_regionkey)
-> Index Scan using
pk_nation on nation (cost=0.00..9.38 rows=25 width=37) (actual
time=0.007..0.063 rows=25 loops=1)
-> Materialize
(cost=1.06..1.07 rows=1 width=4) (actual time=0.002..0.004 rows=1 loops=25)
-> Seq Scan on
region (cost=0.00..1.06 rows=1 width=4) (actual time=0.009..0.018
rows=1 loops=1)
Filter:
(r_name = 'AFRICA'::bpchar)
-> Bitmap Heap Scan on
customer (cost=780.07..5607.85 rows=60025 width=8) (actual
time=61.150..1331.466 rows=59899 loops=5)
Recheck Cond:
(nation.n_nationkey = customer.c_nationkey)
-> Bitmap Index Scan
on i_c_nationkey (cost=0.00..780.07 rows=60025 width=0) (actual
time=44.637..44.637 rows=59899 loops=5)
Index Cond:
(nation.n_nationkey = customer.c_nationkey)
-> Sort (cost=611779.89..617457.31
rows=2270971 width=8) (actual time=26355.515..29471.963 rows=2276859
loops=1)
Sort Key: orders.o_custkey
-> Bitmap Heap Scan on orders
(cost=41391.94..372023.51 rows=2270971 width=8) (actual
time=1630.604..16266.102 rows=2276859 loops=1)
Recheck Cond: ((o_orderdate
>= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on
i_o_orderdate (cost=0.00..41391.94 rows=2270971 width=0) (actual
time=1352.037..1352.037 rows=2276859 loops=1)
Index Cond:
((o_orderdate >= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))
-> Index Scan using pk_supplier on supplier
(cost=0.00..0.03 rows=1 width=8) (actual time=0.010..0.012 rows=1
loops=1822547)
Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
Total runtime: 286984.386 ms
(34 rows)

(about 120s seem to be explain analyze overhead here)

fwiw the box in question is a Dual 2,6Ghz Opteron with 8GB or RAM - wal
is on the BBWC-onboard Smartarray (RAID 10 on 4 disks) and the data is
on a 14 disk Linux Software RAID 10 running Debian Sarge/AMD64 with
Kernel 2.6.17.7. the dbt3 database got initialized with scaling factor
of 10 (running with just 1 works fine).

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-11 19:04:46
Message-ID:	5482.1158001486@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> I'm about to do some benchmarking on -HEAD one some hardware I have
> available and it seems I'm hitting a rather weird issue causing the osdl
> dbt3 benchmark to run very slow and eating CPU time for hours ...

Could we see the actual EXPLAIN ANALYZE results for the slow plan?
I'm unconvinced by your "hash join is bad" analysis, especially in
the cases where you're giving it lots of work_mem. I think it's got
something to do with the different join orders. The rowcount estimates
in the fast plan all seem pretty good, but I'm betting something is
wrong with some of them in the slow case.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-11 19:10:53
Message-ID:	4505B4BD.8070702@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> I'm about to do some benchmarking on -HEAD one some hardware I have
>> available and it seems I'm hitting a rather weird issue causing the osdl
>> dbt3 benchmark to run very slow and eating CPU time for hours ...
>
> Could we see the actual EXPLAIN ANALYZE results for the slow plan?
> I'm unconvinced by your "hash join is bad" analysis, especially in
> the cases where you're giving it lots of work_mem. I think it's got
> something to do with the different join orders. The rowcount estimates
> in the fast plan all seem pretty good, but I'm betting something is
> wrong with some of them in the slow case.

will do - but that will take a while - the slow one runs for 12h or so
even without explain analyze overhead ...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-11 19:32:45
Message-ID:	5701.1158003165@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> btw - the "hashjoin is bad" was more or less based on the observation
> that nearly all of the cpu is burned in hash-related functions in the
> profile (when profiling over a longer period of time those accumulate
> even more % of the time than in the short profile I included in the
> original report)

[ shrug... ] Two out of the three functions you mentioned are not used
by hash join, and anyway the other plan probably has a comparable
execution density in sort-related functions; does that make it bad?

It's possible that the large time for ExecScanHashBucket has something
to do with skewed usage of the hash buckets due to an unfortunate data
distribution, but that's theorizing far in advance of the data.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-12 10:43:18
Message-ID:	45068F46.9070603@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> btw - the "hashjoin is bad" was more or less based on the observation
>> that nearly all of the cpu is burned in hash-related functions in the
>> profile (when profiling over a longer period of time those accumulate
>> even more % of the time than in the short profile I included in the
>> original report)
>
> [ shrug... ] Two out of the three functions you mentioned are not used
> by hash join, and anyway the other plan probably has a comparable
> execution density in sort-related functions; does that make it bad?

hmm sorry for that - I should have checked the source before I made that
assumption :-(

>
> It's possible that the large time for ExecScanHashBucket has something
> to do with skewed usage of the hash buckets due to an unfortunate data
> distribution, but that's theorizing far in advance of the data.

http://www.kaltenbrunner.cc/files/4/

has preliminary data of the dbt3/scaling 10 run I did which seems to
imply we have at least 4 queries in there that take an excessive amount
of time (query 5 is the one I started the complaint with).
However those results have to be taken with a graint of salt since there
is an appearant bug in the dbt3 code which seems to rely on
add_missing_from=on (as can be seen in some of the errorlogs of the
database) and towards the end of the throughput run I did some of the
explain analyzes for the report (those are the small 100% spikes in the
graph due to the box using the second CPU to run them).
I will redo those tests later this week though ...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-13 00:20:59
Message-ID:	3850.1158106859@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> Could we see the actual EXPLAIN ANALYZE results for the slow plan?

> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt

Well, indeed it seems that the hash join is just an innocent bystander:
the bulk of the runtime (all but about 120 sec in fact) is spent here:

-> Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1)
-> Merge Join (cost=0.00..10248.66 rows=20000 width=41) (actual time=16.654..2578.060 rows=19837 loops=1)
...
-> Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837)
Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)
-> Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837)
Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)

I suppose that the profile result you showed was taken during the
startup transient where it was computing the hashtables that this loop's
results are joined to ... but that's not where the problem is. The
problem is repeating that bitmap scan on lineitem for nearly 20000
different l_suppkeys.

Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join. The other plan uses a different join order and doesn't try to
join lineitem until it's got orders.o_orderkey, whereupon it does a
mergejoin against an indexscan on lineitem:

-> Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1)

The runtimes for the remainders of the plans are roughly comparable, so
it's the cost of joining lineitem that is hurting here.

Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the
problem could be overestimating the cost of this indexscan.

What are the physical sizes of lineitem and its indexes, and how do
those compare to your RAM? What are you using for planner settings
(particularly effective_cache_size)?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-13 07:35:30
Message-ID:	4507B4C2.9060208@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Tom Lane wrote:
>>> Could we see the actual EXPLAIN ANALYZE results for the slow plan?
>
>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze.txt
>
> Well, indeed it seems that the hash join is just an innocent bystander:
> the bulk of the runtime (all but about 120 sec in fact) is spent here:
>
> -> Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1)
> -> Merge Join (cost=0.00..10248.66 rows=20000 width=41) (actual time=16.654..2578.060 rows=19837 loops=1)
> ...
> -> Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837)
> Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)
> -> Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837)
> Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
>
> I suppose that the profile result you showed was taken during the
> startup transient where it was computing the hashtables that this loop's
> results are joined to ... but that's not where the problem is. The
> problem is repeating that bitmap scan on lineitem for nearly 20000
> different l_suppkeys.

possible - I actually took them over a longer period of time

>
> Apparently we've made the planner a bit too optimistic about the savings
> that can be expected from repeated indexscans occurring on the inside of
> a join. The other plan uses a different join order and doesn't try to
> join lineitem until it's got orders.o_orderkey, whereupon it does a
> mergejoin against an indexscan on lineitem:
>
> -> Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1)
>
> The runtimes for the remainders of the plans are roughly comparable, so
> it's the cost of joining lineitem that is hurting here.
>
> Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the
> problem could be overestimating the cost of this indexscan.
>
> What are the physical sizes of lineitem and its indexes, and how do
> those compare to your RAM? What are you using for planner settings
> (particularly effective_cache_size)?

ouch - you are right(as usual) here.
effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:

http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt

as for the relation sizes:

dbt3=# select pg_relation_size('lineitem');
pg_relation_size
------------------
10832764928
(1 row)

dbt3=# select pg_total_relation_size('lineitem');
pg_total_relation_size
------------------------
22960259072
(1 row)

there are nine btree indexes on lineitem all between 1,1GB and 1,4GB in
size.

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-13 14:47:09
Message-ID:	11530.1158158829@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> Apparently we've made the planner a bit too optimistic about the savings
>> that can be expected from repeated indexscans occurring on the inside of
>> a join.

> effective_cache_size was set to 10GB(my fault for copying over the conf
> from a 16GB box) during the run - lowering it just a few megabytes(!) or
> to a more realistic 6GB results in the following MUCH better plan:
> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt

Interesting. It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.

It'd be nice not to have to depend on the DBA to give us a good number
for this setting. But I don't know of any portable ways to find out
how much RAM is in the box, let alone what fraction of it we should
assume is available per-query.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-13 15:11:28
Message-ID:	45081FA0.6050704@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Tom Lane wrote:
>>> Apparently we've made the planner a bit too optimistic about the savings
>>> that can be expected from repeated indexscans occurring on the inside of
>>> a join.
>
>> effective_cache_size was set to 10GB(my fault for copying over the conf
>> from a 16GB box) during the run - lowering it just a few megabytes(!) or
>> to a more realistic 6GB results in the following MUCH better plan:
>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
>
> Interesting. It used to be that effective_cache_size wasn't all that
> critical... what I think this report is showing is that with the 8.2
> changes to try to account for caching effects in repeated indexscans,
> we've turned that into a pretty significant parameter.

yes I'm a bit worried about that too - it has been a bit of
"conventional wisdom" that setting effective_cache_size optimistic will
never hurt and that it encourages postgresql to sometimes get a better
plan by favouring index-scans.

>
> It'd be nice not to have to depend on the DBA to give us a good number
> for this setting. But I don't know of any portable ways to find out
> how much RAM is in the box, let alone what fraction of it we should
> assume is available per-query.

well there are really a number of things the dba would better give
accurate information to the database - though in that case we might go
from "too much won't hurt" to "too much will hurt" ...

Stefan

From:	David Fetter <david(at)fetter(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-13 15:30:01
Message-ID:	20060913153001.GB9228@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2006 at 10:47:09AM -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> > Tom Lane wrote:
> >> Apparently we've made the planner a bit too optimistic about the savings
> >> that can be expected from repeated indexscans occurring on the inside of
> >> a join.
>
> > effective_cache_size was set to 10GB(my fault for copying over the conf
> > from a 16GB box) during the run - lowering it just a few megabytes(!) or
> > to a more realistic 6GB results in the following MUCH better plan:
> > http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
>
> Interesting. It used to be that effective_cache_size wasn't all
> that critical... what I think this report is showing is that with
> the 8.2 changes to try to account for caching effects in repeated
> indexscans, we've turned that into a pretty significant parameter.
>
> It'd be nice not to have to depend on the DBA to give us a good
> number for this setting. But I don't know of any portable ways to
> find out how much RAM is in the box, let alone what fraction of it
> we should assume is available per-query.

That's fairly straight-forward, if a little crude. We ask the DBA and
provide some tools for estimating and tuning same. :)

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 06:42:51
Message-ID:	450CEE6B.1090005@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[already sent a variant of that yesterday but it doesn't look like it
made it to the list]

took me a while due to hardware issues on my testbox - but there are new
results(with 6GB for effective_cache_size) up at:

http://www.kaltenbrunner.cc/files/5/

there are still a few issues with the validity of the run like the rf
tests not actually being done right - but lowering effective_cache_size
gave a dramtic speedup on Q5,Q7 and Q8.

that is the explain for the 4h+ Q9:

http://www.kaltenbrunner.cc/files/analyze_q9.txt

increasing the the statistic_target up to 1000 does not seem to change
the plan btw.

disabling nested loop leads to the following (4 times faster) plan:

http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt

since the hash-joins in there look rather slow (inappropriate hashtable
set up due to the wrong estimates?) I disabled hash_joins too:

http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt

and amazingly this plan is by far the fastest one in runtime (15min vs
4,5h ...) except that the planner thinks it is 20 times more expensive ...

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 10:52:55
Message-ID:	450D2907.3070307@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:
> [already sent a variant of that yesterday but it doesn't look like it
> made it to the list]
>
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>>> Tom Lane wrote:
>>>> Apparently we've made the planner a bit too optimistic about the savings
>>>> that can be expected from repeated indexscans occurring on the inside of
>>>> a join.
>>> effective_cache_size was set to 10GB(my fault for copying over the conf
>>> from a 16GB box) during the run - lowering it just a few megabytes(!) or
>>> to a more realistic 6GB results in the following MUCH better plan:
>>> http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
>> Interesting. It used to be that effective_cache_size wasn't all that
>> critical... what I think this report is showing is that with the 8.2
>> changes to try to account for caching effects in repeated indexscans,
>> we've turned that into a pretty significant parameter.
>
> took me a while due to hardware issues on my testbox - but there are new
> results(with 6GB for effective_cache_size) up at:
>
> http://www.kaltenbrunner.cc/files/5/
>
> there are still a few issues with the validity of the run like the rf
> tests not actually being done right - but lowering effective_cache_size
> gave a dramtic speedup on Q5,Q7 and Q8.
>
> that is the explain for the 4h+ Q9:
>
> http://www.kaltenbrunner.cc/files/analyze_q9.txt
>
> increasing the the statistic_target up to 1000 does not seem to change
> the plan btw.
>
> disabling nested loop leads to the following (4 times faster) plan:
>
> http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt
>
> since the hash-joins in there look rather slow (inappropriate hashtable
> set up due to the wrong estimates?) I disabled hash_joins too:
>
> http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt
>
> and amazingly this plan is by far the fastest one in runtime (15min vs
> 4,5h ...) except that the planner thinks it is 20 times more expensive ...

some additional numbers(first one is with default settings, second is
with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
and enable_hashjoin='off'):

http://www.kaltenbrunner.cc/files/analyze_q7.txt

here we have a 3x speedup with disabling nested loops and a 2x speedup
(over the original plan) with nested loops and hashjoins disabled.

http://www.kaltenbrunner.cc/files/analyze_q20.txt

here we have a 180x(!) speedup with both disabled planner options ...

it is worth mentioning that for both queries the estimated costs in
relation to each other looks quite reasonable as soon as enable_nestloop
= 'off' (ie 5042928 vs 10715247 with 344sec vs 514 for Q7 and 101441851
vs 101445468 with 10sec vs 11sec)

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 19:26:56
Message-ID:	23958.1158521216@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> that is the explain for the 4h+ Q9:
> http://www.kaltenbrunner.cc/files/analyze_q9.txt

The big problem there seems to be the drastic misestimation of the
number of rows matching the p_name ~~ '%ghost%' condition. What does
pg_stats have for the p_name column?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 19:39:36
Message-ID:	450DA478.9080605@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> that is the explain for the 4h+ Q9:
>> http://www.kaltenbrunner.cc/files/analyze_q9.txt
>
> The big problem there seems to be the drastic misestimation of the
> number of rows matching the p_name ~~ '%ghost%' condition. What does
> pg_stats have for the p_name column?

http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 19:48:50
Message-ID:	24072.1158522530@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> some additional numbers(first one is with default settings, second is
> with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
> and enable_hashjoin='off'):

> http://www.kaltenbrunner.cc/files/analyze_q7.txt

I'm inclined to think you still have effective_cache_size set too high;
or at least that the planner is being too optimistic about how much
cache space is actually available to each indexscan.

With the code as it currently stands, effective_cache_size has some of
the same properties as work_mem: the planner effectively assumes that
that much space is available to *each* indexscan, and so you'd need to
de-rate the setting based on the complexity of queries and the number of
concurrent sessions.

I'm not sure what we could do about the concurrent-sessions issue, but
we could make some sort of attack on the query complexity issue by
pro-rating the effective_cache_size among all the tables used by a
query.

> http://www.kaltenbrunner.cc/files/analyze_q20.txt
> here we have a 180x(!) speedup with both disabled planner options ...

There's something awfully bogus about that one --- how is it that the
aggregate subplan, with the exact same plan and same number of
executions in all three cases, has an actual runtime 200x more in the
first case?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 20:00:44
Message-ID:	24158.1158523244@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> The big problem there seems to be the drastic misestimation of the
>> number of rows matching the p_name ~~ '%ghost%' condition. What does
>> pg_stats have for the p_name column?

> http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt

Hmm ... pattern_sel already applies the operator directly to the
most_common_vals, but in this situation those aren't common enough
to help much. With such an extensive histogram it is awfully tempting
to assume that the histogram members are a representative sample, and
take the selectivity as being the fraction of histogram entries that
match the pattern. Maybe drop the first and last histogram entries
on the grounds they're probably outliers. Thoughts? What would be a
reasonable minimum histogram size to enable using this approach instead
of the guess-on-the-basis-of-the-pattern code?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 20:05:53
Message-ID:	450DAAA1.10909@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> some additional numbers(first one is with default settings, second is
>> with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
>> and enable_hashjoin='off'):
>
>> http://www.kaltenbrunner.cc/files/analyze_q7.txt
>
> I'm inclined to think you still have effective_cache_size set too high;
> or at least that the planner is being too optimistic about how much
> cache space is actually available to each indexscan.

I have long term external monitoring on that server and it indeed shows
that that there was never less then about 5.8G of buffercache used (or
more then 2.2GB used by other means). So 6G might still be a bit on the
optimistic side but it is not actually that far of from reality.
I will redo with lower settings - do you have any suggestions for that ?

>
> With the code as it currently stands, effective_cache_size has some of
> the same properties as work_mem: the planner effectively assumes that
> that much space is available to *each* indexscan, and so you'd need to
> de-rate the setting based on the complexity of queries and the number of
> concurrent sessions.

concurrency is 1 here - there is never more than a single query running
in parallel in those tests.

>
> I'm not sure what we could do about the concurrent-sessions issue, but
> we could make some sort of attack on the query complexity issue by
> pro-rating the effective_cache_size among all the tables used by a
> query.

hmm not sure i understand what you mean here :-(

>
>
>> http://www.kaltenbrunner.cc/files/analyze_q20.txt
>> here we have a 180x(!) speedup with both disabled planner options ...
>
> There's something awfully bogus about that one --- how is it that the
> aggregate subplan, with the exact same plan and same number of
> executions in all three cases, has an actual runtime 200x more in the
> first case?

hmm - good question. I will redo those in a bit ...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-17 20:18:36
Message-ID:	24373.1158524316@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> I'm not sure what we could do about the concurrent-sessions issue, but
>> we could make some sort of attack on the query complexity issue by
>> pro-rating the effective_cache_size among all the tables used by a
>> query.

> hmm not sure i understand what you mean here :-(

Per the comment for index_pages_fetched:

* We assume that effective_cache_size is the total number of buffer pages
* available for both table and index, and pro-rate that space between the
* table and index. (Ideally other_pages should include all the other
* tables and indexes used by the query too; but we don't have a good way
* to get that number here.)

A first-order approximation to this would be to add up the total sizes
of all the other tables used in the query. I am thinking of leaving out
other indexes, mainly because we can't tell at this level which other
indexes are actually gonna get used. This would tend to underestimate
by leaving out indexes, but not by a lot if you assume indexes are much
smaller than their tables. It would also be an overestimate because
tables that are not indexscanned concurrently with the one under
consideration probably shouldn't be counted anyway. So one might hope
these effects would more or less cancel out. Anyway it seems to be a
better idea than what we have now.

> I will redo with lower settings - do you have any suggestions for that ?

Try reducing effective_cache_size to maybe a fourth of what it is now.
If that helps the thing pick better plans for these multi-table queries,
then we should try changing the other_pages calculation as above.

regards, tom lane

From:	"Jim C(dot) Nasby" <jimn(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	relation cache statistics (was: -HEAD planner issue wrt hash_joins on dbt3 ?)
Date:	2006-09-18 04:08:16
Message-ID:	20060918040816.GT38854@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 17, 2006 at 04:18:36PM -0400, Tom Lane wrote:
> * table and index. (Ideally other_pages should include all the other
> * tables and indexes used by the query too; but we don't have a good way
> * to get that number here.)
>
> A first-order approximation to this would be to add up the total sizes
> of all the other tables used in the query. I am thinking of leaving out
> other indexes, mainly because we can't tell at this level which other
> indexes are actually gonna get used. This would tend to underestimate
> by leaving out indexes, but not by a lot if you assume indexes are much
> smaller than their tables. It would also be an overestimate because
> tables that are not indexscanned concurrently with the one under
> consideration probably shouldn't be counted anyway. So one might hope
> these effects would more or less cancel out. Anyway it seems to be a
> better idea than what we have now.

I think it'd be better to attack this problem from the "other side";
namely looking at what's actually cached. Sadly, I don't think there's
any way to actually query the OS for info about what it has buffered,
but we can look at what's in shared_buffers and assume that it's a
reasonable proxy for the OS's cache. Something like...

relBufPages / shared_buffers * effective_cache_size

should give us a decent idea of what percentage of a relation will be in
cache somewhere. (relBufPages is the number of pages the relation in
question has in the buffer).

Of course, that raises the question of how to track how many pages are
in shared buffers for a relation. Given the criticality of locking
there, we probably don't want to update that info in real-time, but for
this application it's probably OK to just scan through the buffer every
X period of time (maybe after X number of pages read into the buffers).
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jimn(at)enterprisedb(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: relation cache statistics (was: -HEAD planner issue wrt hash_joins on dbt3 ?)
Date:	2006-09-18 04:20:10
Message-ID:	15285.1158553210@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jim C. Nasby" <jimn(at)enterprisedb(dot)com> writes:
> I think it'd be better to attack this problem from the "other side";
> namely looking at what's actually cached.

You can kiss goodbye to plan stability if you go that route... and
in any case I doubt the assumption that what's in shared buffers is
representative of what's in kernel cache.

regards, tom lane

From:	"Jim C(dot) Nasby" <jimn(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: relation cache statistics (was: -HEAD planner issue wrt hash_joins on dbt3 ?)
Date:	2006-09-18 05:04:08
Message-ID:	20060918050408.GY38854@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 18, 2006 at 12:20:10AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jimn(at)enterprisedb(dot)com> writes:
> > I think it'd be better to attack this problem from the "other side";
> > namely looking at what's actually cached.
>
> You can kiss goodbye to plan stability if you go that route... and
> in any case I doubt the assumption that what's in shared buffers is
> representative of what's in kernel cache.

Well, there's 2 issues with caching:

1) Is something we're going to want actually in cache right now?
2) If we need to read something more than once (ie: higher level btree
pages), what are the odds it will still be in cache when we come around
to it the next time.

Once the caches are warmed up, looking at what's actually in them would
give you a very good probability for #1. I suspect that for large
relations, shared_buffers would also match the OS cache pretty well in
most cases, almost certainly better than whatever estimate we're using
now. But I'm not sure how useful that info is to the planner.

For #2 we'd have to know what kind of pressure the caches are under to
replace pages and have some kind of idea how frequently the system is
hitting them. The pg_statio info might be useful there, though
unfortunately in that case I think there's much less likely to be a good
correlation between the two.

If there was *some* way to track stats on page fetches that came out of
the OS cache, I suspect we could make great use of per-relation hit
rates to come up with better plans.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Matteo Beccati <php(at)beccati(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-18 09:02:09
Message-ID:	450E6091.5010407@beccati.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane wrote:
> Hmm ... pattern_sel already applies the operator directly to the
> most_common_vals, but in this situation those aren't common enough
> to help much. With such an extensive histogram it is awfully tempting
> to assume that the histogram members are a representative sample, and
> take the selectivity as being the fraction of histogram entries that
> match the pattern. Maybe drop the first and last histogram entries
> on the grounds they're probably outliers. Thoughts? What would be a
> reasonable minimum histogram size to enable using this approach instead
> of the guess-on-the-basis-of-the-pattern code?

That's what I was suggesting here respectively for ltree operators and like:

http://archives.postgresql.org/pgsql-patches/2006-05/msg00178.php
http://archives.postgresql.org/pgsql-performance/2006-01/msg00083.php

My original ltree patch was stripped of the histogram matching code and
I will need to re-patch 8.2 when deploying it to get decent performance
with a couple of queries, but it would be very nice to avoid it ;)

I cannot see anything bad by using something like that:

if (histogram is large/representative enough)
{
recalculate_selectivity_matching_histogram_values()

if (new_selectivity > old_selectivity)
return new_selectivity
else
return old_selectivity
}

Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matteo Beccati <php(at)beccati(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-18 13:51:19
Message-ID:	27013.1158587479@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Matteo Beccati <php(at)beccati(dot)com> writes:
> I cannot see anything bad by using something like that:
> if (histogram is large/representative enough)

Well, the question is exactly what is "large enough"? I feel a bit
uncomfortable about applying the idea to a histogram with only 10
entries (especially if we ignore two of 'em). With 100 or more,
it sounds all right. What's the breakpoint?

regards, tom lane

From:	Matteo Beccati <php(at)beccati(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-18 15:46:24
Message-ID:	450EBF50.7050402@beccati.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane ha scritto:
> Matteo Beccati <php(at)beccati(dot)com> writes:
>> I cannot see anything bad by using something like that:
>> if (histogram is large/representative enough)
>
> Well, the question is exactly what is "large enough"? I feel a bit
> uncomfortable about applying the idea to a histogram with only 10
> entries (especially if we ignore two of 'em). With 100 or more,
> it sounds all right. What's the breakpoint?

Yes, I think 100-200 could be a good breakpoint. I don't actually know
what is the current usage of SET STATISTICS, I usually set it to 1000
for columns which need more precise selectivity.

The breakpoint could be set even higher (500?) so there is space to
increase statistics without enabling the histogram check, but I don't
feel very comfortable though suggesting this kind of possibly
undocumented side effect...

Best ragards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

From:	Mark Cave-Ayland <mark(dot)cave-ayland(at)ilande(dot)co(dot)uk>
To:	Matteo Beccati <php(at)beccati(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-19 13:55:38
Message-ID:	1158674138.5800.15.camel@mca-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2006-09-18 at 17:46 +0200, Matteo Beccati wrote:
> Tom Lane ha scritto:
> > Matteo Beccati <php(at)beccati(dot)com> writes:
> >> I cannot see anything bad by using something like that:
> >> if (histogram is large/representative enough)
> >
> > Well, the question is exactly what is "large enough"? I feel a bit
> > uncomfortable about applying the idea to a histogram with only 10
> > entries (especially if we ignore two of 'em). With 100 or more,
> > it sounds all right. What's the breakpoint?
>
> Yes, I think 100-200 could be a good breakpoint. I don't actually know
> what is the current usage of SET STATISTICS, I usually set it to 1000
> for columns which need more precise selectivity.
>
> The breakpoint could be set even higher (500?) so there is space to
> increase statistics without enabling the histogram check, but I don't
> feel very comfortable though suggesting this kind of possibly
> undocumented side effect...

Hi everyone,

You may be interested to have a look at the statistics collector for the
geometry type within PostGIS. In order to prevent very large or very
small geometries from ruining the statistics histogram and generating
incorrect query plans, we make the assumption that the column
distribution is likely to be close to normal, and then remove any
ANALYZE-collected geometries from the set that lie outside +/- 3.25
standard deviations from the mean before creating the final histogram
(removes just under 1% of the data from each end of an assumed normal
distribution). This works well and AFAIK we've only ever had one
reported case of an incorrect query plan being generated using this
method.

Kind regards,

Mark.

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-19 18:25:19
Message-ID:	4510360F.6020204@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Tom Lane wrote:
>>> I'm not sure what we could do about the concurrent-sessions issue, but
>>> we could make some sort of attack on the query complexity issue by
>>> pro-rating the effective_cache_size among all the tables used by a
>>> query.
>
>> hmm not sure i understand what you mean here :-(
>
> Per the comment for index_pages_fetched:
>
> * We assume that effective_cache_size is the total number of buffer pages
> * available for both table and index, and pro-rate that space between the
> * table and index. (Ideally other_pages should include all the other
> * tables and indexes used by the query too; but we don't have a good way
> * to get that number here.)
>
> A first-order approximation to this would be to add up the total sizes
> of all the other tables used in the query. I am thinking of leaving out
> other indexes, mainly because we can't tell at this level which other
> indexes are actually gonna get used. This would tend to underestimate
> by leaving out indexes, but not by a lot if you assume indexes are much
> smaller than their tables. It would also be an overestimate because
> tables that are not indexscanned concurrently with the one under
> consideration probably shouldn't be counted anyway. So one might hope
> these effects would more or less cancel out. Anyway it seems to be a
> better idea than what we have now.

aah - I think I understand that logic now - thanks for the reference to
the source :-)

>
>> I will redo with lower settings - do you have any suggestions for that ?
>
> Try reducing effective_cache_size to maybe a fourth of what it is now.
> If that helps the thing pick better plans for these multi-table queries,
> then we should try changing the other_pages calculation as above.

ok - the planner switches to a different plan at about 2.5GB of
effective_cache_size resulting in the following plan:

http://www.kaltenbrunner.cc/files/analyze_q7_1GB.txt (3 consecutive runs
- starting with cold caches)

with 6GB I get:

http://www.kaltenbrunner.cc/files/analyze_q7_6GB.txt (single run -
immediatly after the above ones)

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-19 19:00:13
Message-ID:	15529.1158692413@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> ok - the planner switches to a different plan at about 2.5GB of
> effective_cache_size resulting in the following [ much better ] plan:

OK, so it sounds like it'd be a good idea to try to pro-rate
effective_cache_size among all the tables in the query. I'll see
if I can get that in without kluging the code up too much.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matteo Beccati <php(at)beccati(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-20 19:52:10
Message-ID:	17841.1158781930@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Matteo Beccati <php(at)beccati(dot)com> writes:
> Tom Lane ha scritto:
>> Matteo Beccati <php(at)beccati(dot)com> writes:
>>> I cannot see anything bad by using something like that:
>>> if (histogram is large/representative enough)
>>
>> Well, the question is exactly what is "large enough"? I feel a bit
>> uncomfortable about applying the idea to a histogram with only 10
>> entries (especially if we ignore two of 'em). With 100 or more,
>> it sounds all right. What's the breakpoint?

> Yes, I think 100-200 could be a good breakpoint.

I've committed this change with (for now) 100 as the minimum histogram
size to use. Stefan, are you interested in retrying your benchmark?

regards, tom lane

From:	Matteo Beccati <php(at)beccati(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-20 23:25:28
Message-ID:	4511CDE8.2000400@beccati.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane wrote:
> I've committed this change with (for now) 100 as the minimum histogram
> size to use. Stefan, are you interested in retrying your benchmark?

A first try with ltree gave big improvements on my smaller data set: the
estimated row count is correct or off by only 1 row. I'm now restoring a
bigger database to get more reliable results.

I hope Stefan can confirm the improvement on dbt3 too.

Thanks Tom :)

Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matteo Beccati <php(at)beccati(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-22 16:56:00
Message-ID:	451415A0.2090608@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Matteo Beccati <php(at)beccati(dot)com> writes:
>> Tom Lane ha scritto:
>>> Matteo Beccati <php(at)beccati(dot)com> writes:
>>>> I cannot see anything bad by using something like that:
>>>> if (histogram is large/representative enough)
>>> Well, the question is exactly what is "large enough"? I feel a bit
>>> uncomfortable about applying the idea to a histogram with only 10
>>> entries (especially if we ignore two of 'em). With 100 or more,
>>> it sounds all right. What's the breakpoint?
>
>> Yes, I think 100-200 could be a good breakpoint.
>
> I've committed this change with (for now) 100 as the minimum histogram
> size to use. Stefan, are you interested in retrying your benchmark?

sure - but I'm having hardware (harddisk firmware) related issues on my
testbox which will take a few further days to be resolved ...

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-25 16:51:08
Message-ID:	451808FC.1030804@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[RESENT in a modified version since the original reply seems to have ben
lost by the listserver which seems to happen sometimes lately]

spent some time retesting that and I got the following results(this is
the same box as before but with a much slower disk-setup and a newly
initdb'd cluster):

http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt

all that is with 2GB of effective_cache_size(plan does not change with
much smaller settings btw) and a statistic target of 1000 and the
following parameters:

default planner settings for the EXPLAIN and the first EXPLAIN ANALYZE -
then the same query with disabled hash_joins, the next one is with
disabled nest_loops and the fastest one is with both nest_loop and
hash_joins disabled (all run in that order - so there are possible
caching effects).

in comparision to:

http://www.kaltenbrunner.cc/files/analyze_q9.txt

we nearly got a 7 figure speedup due to the latest changes(much better
estimates at least) - however the mergejoin+sort only plan is still faster.

the other troubling query is the following:

query: http://www.kaltenbrunner.cc/files/7/power1/db/plans/power_query21.txt

plans: http://www.kaltenbrunner.cc/files/analyze_q21_beta1.txt

(default,default,enable_nestloop=off,enable_nestloop=off and
enable_hashjoin=off)

despite having not-too bad estimates for most of the key-points in the
plan the actual runtime of the choosen plan is quite disappointing.

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Matteo Beccati <php(at)beccati(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-25 22:58:59
Message-ID:	16426.1159225139@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt

The next problem seems to be the drastic misestimation of this join
size:

-> Nested Loop (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044 loops=1)
-> Merge Join (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008 loops=1)
Merge Cond: (part.p_partkey = partsupp.ps_partkey)
-> Index Scan using pk_part on part (cost=0.00..105830.22 rows=112447 width=4) (actual time=34.646..14381.644 rows=118502 loops=1)
Filter: ((p_name)::text ~~ '%ghost%'::text)
-> Index Scan using i_ps_partkey on partsupp (cost=0.00..388943.05 rows=8000278 width=12) (actual time=13.511..22659.364 rows=7999685 loops=1)
-> Index Scan using i_l_suppkey_partkey on lineitem (cost=0.00..14.11 rows=1 width=24) (actual time=4.415..30.310 rows=7 loops=474008)
Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey))

With a factor-of-25000 error in that rowcount estimate, it's amazing the
plans aren't worse than they are.

It evidently thinks that most of the rows in the join of part and
partsupp won't have any matching rows in lineitem, whereas on average
there are about 7 matching rows apiece. So that's totally wacko, and
it's not immediately obvious why. Could we see the pg_stats entries for
part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
lineitem.l_partkey, lineitem.l_suppkey?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matteo Beccati <php(at)beccati(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-26 06:06:45
Message-ID:	4518C375.1090003@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> http://www.kaltenbrunner.cc/files/analyze_q9_beta1.txt
>
> The next problem seems to be the drastic misestimation of this join
> size:
>
> -> Nested Loop (cost=0.00..6872092.36 rows=135 width=28) (actual time=94.762..14429291.129 rows=3554044 loops=1)
> -> Merge Join (cost=0.00..519542.74 rows=449804 width=16) (actual time=48.197..49636.006 rows=474008 loops=1)
> Merge Cond: (part.p_partkey = partsupp.ps_partkey)
> -> Index Scan using pk_part on part (cost=0.00..105830.22 rows=112447 width=4) (actual time=34.646..14381.644 rows=118502 loops=1)
> Filter: ((p_name)::text ~~ '%ghost%'::text)
> -> Index Scan using i_ps_partkey on partsupp (cost=0.00..388943.05 rows=8000278 width=12) (actual time=13.511..22659.364 rows=7999685 loops=1)
> -> Index Scan using i_l_suppkey_partkey on lineitem (cost=0.00..14.11 rows=1 width=24) (actual time=4.415..30.310 rows=7 loops=474008)
> Index Cond: ((partsupp.ps_partkey = lineitem.l_partkey) AND (partsupp.ps_suppkey = lineitem.l_suppkey))
>
> With a factor-of-25000 error in that rowcount estimate, it's amazing the
> plans aren't worse than they are.
>
> It evidently thinks that most of the rows in the join of part and
> partsupp won't have any matching rows in lineitem, whereas on average
> there are about 7 matching rows apiece. So that's totally wacko, and
> it's not immediately obvious why. Could we see the pg_stats entries for
> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
> lineitem.l_partkey, lineitem.l_suppkey?

http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Matteo Beccati <php(at)beccati(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-26 21:53:28
Message-ID:	570.1159307608@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Tom Lane wrote:
>> It evidently thinks that most of the rows in the join of part and
>> partsupp won't have any matching rows in lineitem, whereas on average
>> there are about 7 matching rows apiece. So that's totally wacko, and
>> it's not immediately obvious why. Could we see the pg_stats entries for
>> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
>> lineitem.l_partkey, lineitem.l_suppkey?

> http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt

OK, so we have 2 million parts and 100000 suppliers, and ANALYZE doesn't
seem to have been too far off at estimating either of those numbers.
I think the problem is that there are not very many suppliers for any
particular part, and thus the condition "part match AND supplier match"
is really not much more selective than "part match" alone. The planner
is supposing that their selectivities are independent, which they
aren't.

Offhand I don't see any good way to fix this without multi-column
statistics, which is something that's certainly not happening for 8.2 :-(

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matteo Beccati <php(at)beccati(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-27 05:58:39
Message-ID:	451A130F.7030400@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Tom Lane wrote:
>>> It evidently thinks that most of the rows in the join of part and
>>> partsupp won't have any matching rows in lineitem, whereas on average
>>> there are about 7 matching rows apiece. So that's totally wacko, and
>>> it's not immediately obvious why. Could we see the pg_stats entries for
>>> part.p_partkey, partsupp.ps_partkey, partsupp.ps_suppkey,
>>> lineitem.l_partkey, lineitem.l_suppkey?
>
>> http://www.kaltenbrunner.cc/files/dbt3_beta1_statistics.txt
>
> OK, so we have 2 million parts and 100000 suppliers, and ANALYZE doesn't
> seem to have been too far off at estimating either of those numbers.
> I think the problem is that there are not very many suppliers for any
> particular part, and thus the condition "part match AND supplier match"
> is really not much more selective than "part match" alone. The planner
> is supposing that their selectivities are independent, which they
> aren't.

looks like there are exactly 4 suppliers for any given part so that
seems indeed like the problem :-(

>
> Offhand I don't see any good way to fix this without multi-column
> statistics, which is something that's certainly not happening for 8.2 :-(

too bad - however any idea on one of the other troubling querys (q21) I
mentioned in the mail I resent to the list (after the original one got
lost)?

http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php

Stefan

From:	Matteo Beccati <php(at)beccati(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-27 22:23:24
Message-ID:	451AF9DC.3070903@beccati.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:
> too bad - however any idea on one of the other troubling querys (q21) I
> mentioned in the mail I resent to the list (after the original one got
> lost)?
>
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php

What happens if you increase statistics for l_orderkey?

Best regards
--
Matteo Beccati
http://phpadsnew.com
http://phppgads.com

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Matteo Beccati <php(at)beccati(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: -HEAD planner issue wrt hash_joins on dbt3 ?
Date:	2006-09-28 16:16:09
Message-ID:	451BF549.3010607@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Matteo Beccati wrote:
> Stefan Kaltenbrunner wrote:
>> too bad - however any idea on one of the other troubling querys (q21) I
>> mentioned in the mail I resent to the list (after the original one got
>> lost)?
>>
>> http://archives.postgresql.org/pgsql-hackers/2006-09/msg02011.php
>
> What happens if you increase statistics for l_orderkey?

statistic target is already at 1000 ...

Stefan