Quick Links

Re: Hash or merge join instead of inner loop

Lists:	pgsql-performance

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Hash or merge join instead of inner loop
Date:	2003-06-09 20:40:09
Message-ID:	20030609204009.GJ40542@flake.decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I have a query that's cauing pgsql choose either a hash or merge join
depending on how I mess with the stats variables, but it won't choose an
nested loop, even though it's the fastest.

The estimate for the nested loop index scans always seems to be way high
on the high end. Note that it's 0-3 in one case and 0-2 in the other,
but the actual time is very low in both cases. Why is this? I haven't
been able to make much of a difference by changing the optimizer
variables.

This is on a solaris machine, if that matters. Tinput_data, locality,
and postal code have 1300, 28000 and 43000 rows, respectively, and
locality and postal code are very narrow tables (full definition below).

usps=# explain analyze SELECT key, pc.locality_id, l.state_code::varchar FROM Tinput_data i, postal_code pc, locality l WHERE i.zip = pc.postal_code AND l.locality_id = pc.locality_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=940.20..1417.94 rows=1380 width=36) (actual time=1727.30..2363.91 rows=1380 loops=1)
Merge Cond: ("outer".locality_id = "inner".locality_id)
-> Index Scan using locality_pkey on locality l (cost=0.00..455.99 rows=27789 width=10) (actual time=0.62..495.39 rows=27632 loops=1)
-> Sort (cost=940.20..940.55 rows=1380 width=26) (actual time=1725.53..1726.71 rows=1380 loops=1)
Sort Key: pc.locality_id
-> Merge Join (cost=42.00..933.00 rows=1380 width=26) (actual time=56.27..1684.67 rows=1380 loops=1)
Merge Cond: ("outer".postal_code = "inner".zip)
-> Index Scan using postal_code_postal_code_key on postal_code pc (cost=0.00..869.31 rows=42704 width=13) (actual time=10.05..1396.11 rows=42418 loops=1)
-> Sort (cost=42.00..42.34 rows=1380 width=13) (actual time=39.63..40.97 rows=1380 loops=1)
Sort Key: i.zip
-> Seq Scan on tinput_data i (cost=0.00..34.80 rows=1380 width=13) (actual time=0.02..12.13 rows=1380 loops=1)
Total runtime: 2367.50 msec
(12 rows)

usps=# set enable_mergejoin=0;
SET
usps=# set enable_hashjoin=0;
SET
usps=# explain analyze SELECT key, pc.locality_id, l.state_code::varchar FROM Tinput_data i, postal_code pc, locality l WHERE i.zip = pc.postal_code AND l.locality_id = pc.locality_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..6991.66 rows=1380 width=36) (actual time=0.22..231.00 rows=1380 loops=1)
-> Nested Loop (cost=0.00..4203.23 rows=1380 width=26) (actual time=0.14..132.70 rows=1380 loops=1)
-> Seq Scan on tinput_data i (cost=0.00..34.80 rows=1380 width=13) (actual time=0.02..17.41 rows=1380 loops=1)
-> Index Scan using postal_code_postal_code_key on postal_code pc (cost=0.00..3.01 rows=1 width=13) (actual time=0.06..0.06 rows=1 loops=1380)
Index Cond: ("outer".zip = pc.postal_code)
-> Index Scan using locality_pkey on locality l (cost=0.00..2.01 rows=1 width=10) (actual time=0.05..0.05 rows=1 loops=1380)
Index Cond: (l.locality_id = "outer".locality_id)
Total runtime: 233.60 msec
(8 rows)

usps=# \d postal_code
Table "public.postal_code"
Column | Type | Modifiers
----------------+-----------------------+-------------------------------------------------------------------------
postal_code_id | integer | not null default nextval('public.postal_code_postal_code_id_seq'::text)
postal_code | character varying(10) | not null
locality_id | integer | not null
Indexes: postal_code_pkey primary key btree (postal_code_id),
postal_code_postal_code_key unique btree (postal_code)

usps=# \d locality
Table "public.locality"
Column | Type | Modifiers
-------------+-----------------------+-------------------------------------------------------------------
locality_id | integer | not null default nextval('public.locality_locality_id_seq'::text)
locality | character varying(10) | not null
state_code | character(2) | not null
Indexes: locality_pkey primary key btree (locality_id)
Foreign Key constraints: $1 FOREIGN KEY (state_code) REFERENCES state(state_code) ON UPDATE NO ACTION ON DELETE NO ACTION

--
Jim C. Nasby (aka Decibel!) jim(at)nasby(dot)net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	jim(at)nasby(dot)net
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Hash or merge join instead of inner loop
Date:	2003-06-10 06:15:11
Message-ID:	274.1055225711@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> I have a query that's cauing pgsql choose either a hash or merge join
> depending on how I mess with the stats variables, but it won't choose an
> nested loop, even though it's the fastest.

There's been some discussion about that before; you could check the
archives (now that they're up again ;-)). I believe that the planner
overestimates the cost of a nestloop with inner indexscan, because it
costs the indexscans as though each one is an independent ab-initio
index search. In reality, most of the upper btree levels will no doubt
stay in memory during such a query, and so this estimate charges many
more reads than really occur. Fixing this is on the todo list, but no
one's got to it yet. (It's not clear to me how to put the consideration
into the planner's cost algorithms in a clean way.)

regards, tom lane

From:	"Shridhar Daithankar" <shridhar_daithankar(at)persistent(dot)co(dot)in>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Hash or merge join instead of inner loop
Date:	2003-06-10 08:56:17
Message-ID:	3EE5EA89.19260.8C8B669@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 10 Jun 2003 at 2:15, Tom Lane wrote:

> There's been some discussion about that before; you could check the
> archives (now that they're up again ;-)). I believe that the planner
> overestimates the cost of a nestloop with inner indexscan, because it
> costs the indexscans as though each one is an independent ab-initio
> index search. In reality, most of the upper btree levels will no doubt
> stay in memory during such a query, and so this estimate charges many
> more reads than really occur. Fixing this is on the todo list, but no
> one's got to it yet. (It's not clear to me how to put the consideration
> into the planner's cost algorithms in a clean way.)

Just being naïve here, but if planner and executor account for shared
buffers+effective OS cache, even a boolean choice could be a start.

Say a query needs 100MB of data according to estimates so if shared
buffers+effective OS cache covers that, we can lower the cost.

May be we should have two config. parameters for tuple cost? Disk read tuple
cost and memory read tuple cost. Later being 1/10th of former?

Bye
Shridhar

--
All new: Parts not interchangeable with previous model.

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Hash or merge join instead of inner loop
Date:	2003-06-10 19:42:07
Message-ID:	20030610194206.GK40542@flake.decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, Jun 10, 2003 at 02:15:11AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> > I have a query that's cauing pgsql choose either a hash or merge join
> > depending on how I mess with the stats variables, but it won't choose an
> > nested loop, even though it's the fastest.
>
> There's been some discussion about that before; you could check the
> archives (now that they're up again ;-)). I believe that the planner
> overestimates the cost of a nestloop with inner indexscan, because it
> costs the indexscans as though each one is an independent ab-initio
> index search. In reality, most of the upper btree levels will no doubt
> stay in memory during such a query, and so this estimate charges many
> more reads than really occur. Fixing this is on the todo list, but no
> one's got to it yet. (It's not clear to me how to put the consideration
> into the planner's cost algorithms in a clean way.)

What about just ignoring all but the leaf pages? Unless you have a
really, really big index, I think this would probably work well, or at
least better than what we have right now.

I can't think of an elegant way to figure out hit percentages either.
Maybe as a ratio of how often an individual page at a given level of the
btree is to be hit? IE: the root page will always be hit (only one
page); if the next level up has 10 pages, each one is 10% likely to be
in cache, and so-on. Or maybe a better way to look at it is how many
pages sit underneath each page. So if we figure there's a 0.1% chance that
a leaf page is in cache and each page in the layer above/below that has
tuples for 100 leaf pages, then the odds of a page in that layer being
in the cache is 10%

It might also be worth giving index pages a higher priority in the
internal buffer than table pages.
--
Jim C. Nasby (aka Decibel!) jim(at)nasby(dot)net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	jim(at)nasby(dot)net
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Hash or merge join instead of inner loop
Date:	2003-06-10 20:56:45
Message-ID:	4799.1055278605@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> On Tue, Jun 10, 2003 at 02:15:11AM -0400, Tom Lane wrote:
>> ... In reality, most of the upper btree levels will no doubt
>> stay in memory during such a query, and so this estimate charges many
>> more reads than really occur. Fixing this is on the todo list, but no
>> one's got to it yet. (It's not clear to me how to put the consideration
>> into the planner's cost algorithms in a clean way.)

> What about just ignoring all but the leaf pages?

IIRC, we already know what cost model we want to use. The problem is
that the planner's code structure makes it difficult for the indexscan
coster to know that the indexscan will be applied repeatedly rather than
just once. That's what has to be solved.

regards, tom lane