From: | postgresuser(at)yahoo(dot)com |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #6668: hashjoin cost problem |
Date: | 2012-05-31 00:24:49 |
Message-ID: | E1SZtC5-00028E-DI@wrigleys.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 6668
Logged by: Postgres User
Email address: postgresuser(at)yahoo(dot)com
PostgreSQL version: 9.1.3
Operating system: Ubuntu
Description:
work_mem 1MB
create table small(i) as select (g/1000) * 1000 from
generate_series(1,10000) g;
create table large(i) as select generate_series(1,100000000);
vacuum;
vacuum;
vacuum analyze;
explain analyze select * from small inner join large using (i);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
-------------
Hash Join (cost=3083103.00..3475328.00 rows=10000 width=4) (actual
time=84079.899..84989.419 rows=9001 loops=1)
Hash Cond: (small.i = large.i)
-> Seq Scan on small (cost=0.00..145.00 rows=10000 width=4) (actual
time=0.008..0.588 rows=10000 loops=1)
-> Hash (cost=1442478.00..1442478.00 rows=100000000 width=4) (actual
time=84079.741..84079.741 rows=100000000 loops
=1)
Buckets: 4096 Batches: 4096 Memory Usage: 853kB
-> Seq Scan on large (cost=0.00..1442478.00 rows=100000000
width=4) (actual time=0.005..59011.443 rows=100000
000 loops=1)
Total runtime: 84990.270 ms
(7 rows)
It doesn't matter how big the big table is... for this distribution large
table is hashed.
Forcing (gdb) the cost in one of the cost_hashjoin calls to 0, it chooses to
hash the smaller table with better results:
explain analyze select * from small inner join large using (i);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
------
Hash Join (cost=270.00..0.00 rows=10000 width=4) (actual
time=14028.034..16510.598 rows=9001 loops=1)
Hash Cond: (large.i = small.i)
-> Seq Scan on large (cost=0.00..1442478.00 rows=100000000 width=4)
(actual time=0.010..5893.344 rows=100000000 loo
ps=1)
-> Hash (cost=145.00..145.00 rows=10000 width=4) (actual
time=3.854..3.854 rows=10000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 352kB
-> Seq Scan on small (cost=0.00..145.00 rows=10000 width=4)
(actual time=0.005..1.585 rows=10000 loops=1)
Total runtime: 16510.962 ms
(7 rows)
More in gdb, all of the cost seems to come from:
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
(outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5)
is 50 billion, leading to a wild cost. The parent's estimate of the number
of rows is rightly estimated at 10000, so 50 billion comparisons is
obviously bad estimate.
From | Date | Subject | |
---|---|---|---|
Next Message | Edmund Horner | 2012-05-31 02:14:27 | 9.2 beta1 libxml2 can't be loaded on Windows |
Previous Message | Tomas Vondra | 2012-05-30 21:43:35 | Re: overwriting an existing .so while being used crashes the server process |