Quick Links

Re: merge join killing performance

Lists:	pgsql-hackerspgsql-performance

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	merge join killing performance
Date:	2010-05-19 00:17:33
Message-ID:	AANLkTikmwxQJIDfefCoK_SQN2DTwQs1naLwS-RW8PgRU@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Machine: 8 core AMD opteron 2.1GHz, 12 disk RAID-10, 2 disk pg_xlog,
RHEL 5.4 pg version 8.3.9 (upgrading soon to 8.3.11 or so)

This query:
SELECT sum(f.bytes) AS sum FROM files f INNER JOIN events ev ON f.eid
= ev.eid WHERE ev.orgid = 969677;

is choosing a merge join, which never returns from explain analyze (it
might after 10 or so minutes, but I'm not beating up my production
server over it)

Aggregate (cost=902.41..902.42 rows=1 width=4)
-> Merge Join (cost=869.97..902.40 rows=1 width=4)
Merge Cond: (f.eid = ev.eid)
-> Index Scan using files_eid_idx on files f
(cost=0.00..157830.39 rows=3769434 width=8)
-> Sort (cost=869.52..872.02 rows=1002 width=4)
Sort Key: ev.eid
-> Index Scan using events_orgid_idx on events ev
(cost=0.00..819.57 rows=1002 width=4)
Index Cond: (orgid = 969677)

If I turn off mergejoin it's fast:

explain analyze SELECT sum(f.bytes) AS sum FROM files f INNER JOIN
events ev ON f.eid = ev.eid WHERE ev.orgid = 969677;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3653.28..3653.29 rows=1 width=4) (actual
time=1.541..1.541 rows=1 loops=1)
-> Nested Loop (cost=0.00..3653.28 rows=1 width=4) (actual
time=1.537..1.537 rows=0 loops=1)
-> Index Scan using events_orgid_idx on events ev
(cost=0.00..819.57 rows=1002 width=4) (actual time=0.041..0.453
rows=185 loops=1)
Index Cond: (orgid = 969677)
-> Index Scan using files_eid_idx on files f
(cost=0.00..2.82 rows=1 width=8) (actual time=0.005..0.005 rows=0
loops=185)
Index Cond: (f.eid = ev.eid)
Total runtime: 1.637 ms

I've played around with random_page_cost. All the other things you'd
expect, like effective_cache_size are set rather large (it's a server
with 32Gig ram and a 12 disk RAID-10) and no setting of
random_page_cost forces it to choose the non-mergejoin plan.

Anybody with any ideas, I'm all ears.

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 03:00:18
Message-ID:	alpine.DEB.2.00.1005182254110.1867@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, 18 May 2010, Scott Marlowe wrote:
> Aggregate (cost=902.41..902.42 rows=1 width=4)
> -> Merge Join (cost=869.97..902.40 rows=1 width=4)
> Merge Cond: (f.eid = ev.eid)
> -> Index Scan using files_eid_idx on files f
> (cost=0.00..157830.39 rows=3769434 width=8)

Okay, that's weird. How is the cost of the merge join only 902, when the
cost of one of the branches 157830, when there is no LIMIT?

Are the statistics up to date?

Matthew

--
As you approach the airport, you see a sign saying "Beware - low
flying airplanes". There's not a lot you can do about that. Take
your hat off? -- Michael Flanders

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 03:06:25
Message-ID:	AANLkTinmK2qgKBQXpMZ0rDa66OsGH1MynmVwfkh1PZVy@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, May 18, 2010 at 9:00 PM, Matthew Wakeling <matthew(at)flymine(dot)org> wrote:
> On Tue, 18 May 2010, Scott Marlowe wrote:
>>
>> Aggregate (cost=902.41..902.42 rows=1 width=4)
>> -> Merge Join (cost=869.97..902.40 rows=1 width=4)
>> Merge Cond: (f.eid = ev.eid)
>> -> Index Scan using files_eid_idx on files f
>> (cost=0.00..157830.39 rows=3769434 width=8)
>
> Okay, that's weird. How is the cost of the merge join only 902, when the
> cost of one of the branches 157830, when there is no LIMIT?
>
> Are the statistics up to date?

Yep. The explain analyze shows it being close enough it should guess
right (I think) We have default stats target set to 200 and the table
is regularly analyzed by autovac, which now has much smaller settings
for threshold and % than default to handle these big tables.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 16:53:03
Message-ID:	21497.1274287983@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> On Tue, 18 May 2010, Scott Marlowe wrote:
>> Aggregate (cost=902.41..902.42 rows=1 width=4)
>> -> Merge Join (cost=869.97..902.40 rows=1 width=4)
>> Merge Cond: (f.eid = ev.eid)
>> -> Index Scan using files_eid_idx on files f
>> (cost=0.00..157830.39 rows=3769434 width=8)

> Okay, that's weird. How is the cost of the merge join only 902, when the
> cost of one of the branches 157830, when there is no LIMIT?

It's apparently estimating (wrongly) that the merge join won't have to
scan very much of "files" before it can stop because it finds an eid
value larger than any eid in the other table. So the issue here is an
inexact stats value for the max eid.

regards, tom lane

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 17:08:21
Message-ID:	AANLkTim6976amN_52jOUSGepq7N5aV2qy1jTBLSSJKbV@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 19, 2010 at 10:53 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Matthew Wakeling <matthew(at)flymine(dot)org> writes:
>> On Tue, 18 May 2010, Scott Marlowe wrote:
>>> Aggregate (cost=902.41..902.42 rows=1 width=4)
>>> -> Merge Join (cost=869.97..902.40 rows=1 width=4)
>>> Merge Cond: (f.eid = ev.eid)
>>> -> Index Scan using files_eid_idx on files f
>>> (cost=0.00..157830.39 rows=3769434 width=8)
>
>> Okay, that's weird. How is the cost of the merge join only 902, when the
>> cost of one of the branches 157830, when there is no LIMIT?
>
> It's apparently estimating (wrongly) that the merge join won't have to
> scan very much of "files" before it can stop because it finds an eid
> value larger than any eid in the other table. So the issue here is an
> inexact stats value for the max eid.

That's a big table. I'll try cranking up the stats target for that
column and see what happens. Thanks!

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 20:27:05
Message-ID:	AANLkTimfo4qryWUFj6sYRSlQrFZpgLaqB3__Vt0y2Rdl@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I changed stats target to 1000 for that field and still get the bad plan.

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-19 20:47:06
Message-ID:	AANLkTin18NoamzGnXHAo5VK6bmDe104ncq2gvsp7u0ug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 19, 2010 at 2:27 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Wed, May 19, 2010 at 10:53 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Matthew Wakeling <matthew(at)flymine(dot)org> writes:
>>> On Tue, 18 May 2010, Scott Marlowe wrote:
>>>> Aggregate (cost=902.41..902.42 rows=1 width=4)
>>>> -> Merge Join (cost=869.97..902.40 rows=1 width=4)
>>>> Merge Cond: (f.eid = ev.eid)
>>>> -> Index Scan using files_eid_idx on files f
>>>> (cost=0.00..157830.39 rows=3769434 width=8)
>>
>>> Okay, that's weird. How is the cost of the merge join only 902, when the
>>> cost of one of the branches 157830, when there is no LIMIT?
>>
>> It's apparently estimating (wrongly) that the merge join won't have to
>> scan very much of "files" before it can stop because it finds an eid
>> value larger than any eid in the other table. So the issue here is an
>> inexact stats value for the max eid.
>
> I changed stats target to 1000 for that field and still get the bad plan.

And of course ran analyze across the table...

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 01:46:29
Message-ID:	alpine.DEB.2.00.1005192143570.6090@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, 19 May 2010, Scott Marlowe wrote:
>> It's apparently estimating (wrongly) that the merge join won't have to
>> scan very much of "files" before it can stop because it finds an eid
>> value larger than any eid in the other table. So the issue here is an
>> inexact stats value for the max eid.

I wandered if it could be something like that, but I rejected that idea,
as it obviously wasn't the real world case, and statistics should at least
get that right, if they are up to date.

> I changed stats target to 1000 for that field and still get the bad plan.

What do the stats say the max values are?

Matthew

--
Nog: Look! They've made me into an ensign!
O'Brien: I didn't know things were going so badly.
Nog: Frightening, isn't it?

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 02:04:18
Message-ID:	AANLkTilruHtb1cc-TwN_bPVypkhIQo7p8I0y7035xLZ-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling <matthew(at)flymine(dot)org> wrote:
> On Wed, 19 May 2010, Scott Marlowe wrote:
>>>
>>> It's apparently estimating (wrongly) that the merge join won't have to
>>> scan very much of "files" before it can stop because it finds an eid
>>> value larger than any eid in the other table. So the issue here is an
>>> inexact stats value for the max eid.
>
> I wandered if it could be something like that, but I rejected that idea, as
> it obviously wasn't the real world case, and statistics should at least get
> that right, if they are up to date.
>
>> I changed stats target to 1000 for that field and still get the bad plan.
>
> What do the stats say the max values are?

5277063,5423043,13843899 (I think).

# select count(distinct eid) from files;
count
-------
365
(1 row)

# select count(*) from files;
count
---------
3793748

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 02:06:15
Message-ID:	AANLkTil5u_ZEPtdVfsl4-XZuieLcXvd9wn5G2jc6k4Eo@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 19, 2010 at 8:04 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling <matthew(at)flymine(dot)org> wrote:
>> On Wed, 19 May 2010, Scott Marlowe wrote:
>>>>
>>>> It's apparently estimating (wrongly) that the merge join won't have to
>>>> scan very much of "files" before it can stop because it finds an eid
>>>> value larger than any eid in the other table. So the issue here is an
>>>> inexact stats value for the max eid.
>>
>> I wandered if it could be something like that, but I rejected that idea, as
>> it obviously wasn't the real world case, and statistics should at least get
>> that right, if they are up to date.
>>
>>> I changed stats target to 1000 for that field and still get the bad plan.
>>
>> What do the stats say the max values are?
>
> 5277063,5423043,13843899 (I think).
>
> # select count(distinct eid) from files;
> count
> -------
> 365
> (1 row)
>
> # select count(*) from files;
> count
> ---------
> 3793748

A followup. of those rows,

select count(*) from files where eid is null;
count
---------
3793215

are null.

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 02:07:39
Message-ID:	AANLkTik0NxYr5FAqlYOqseHhT7WaJv6uJuMvD7ScONnV@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 19, 2010 at 8:06 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Wed, May 19, 2010 at 8:04 PM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
>> On Wed, May 19, 2010 at 7:46 PM, Matthew Wakeling <matthew(at)flymine(dot)org> wrote:
>>> On Wed, 19 May 2010, Scott Marlowe wrote:
>>>>>
>>>>> It's apparently estimating (wrongly) that the merge join won't have to
>>>>> scan very much of "files" before it can stop because it finds an eid
>>>>> value larger than any eid in the other table. So the issue here is an
>>>>> inexact stats value for the max eid.
>>>
>>> I wandered if it could be something like that, but I rejected that idea, as
>>> it obviously wasn't the real world case, and statistics should at least get
>>> that right, if they are up to date.
>>>
>>>> I changed stats target to 1000 for that field and still get the bad plan.
>>>
>>> What do the stats say the max values are?
>>
>> 5277063,5423043,13843899 (I think).
>>
>> # select count(distinct eid) from files;
>> count
>> -------
>> 365
>> (1 row)
>>
>> # select count(*) from files;
>> count
>> ---------
>> 3793748
>
> A followup. of those rows,
>
> select count(*) from files where eid is null;
> count
> ---------
> 3793215
>
> are null.

So, Tom, so you think it's possible that the planner isn't noticing
all those nulls and thinks it'll just take a row or two to get to the
value it needs to join on?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 14:28:12
Message-ID:	3829.1274365692@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> So, Tom, so you think it's possible that the planner isn't noticing
> all those nulls and thinks it'll just take a row or two to get to the
> value it needs to join on?

Could be. I don't have time right now to chase through the code, but
that sounds like a plausible theory.

regards, tom lane

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-20 14:35:37
Message-ID:	AANLkTikOp1XVFS1O8j9IiAz95Ihf5Ag9DQvpCSeGwA4G@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, May 20, 2010 at 8:28 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
>> So, Tom, so you think it's possible that the planner isn't noticing
>> all those nulls and thinks it'll just take a row or two to get to the
>> value it needs to join on?
>
> Could be. I don't have time right now to chase through the code, but
> that sounds like a plausible theory.

K. I think I'll try an index on that field "where not null" and see
if that helps.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	mergejoin null handling (was Re: [PERFORM] merge join killing performance)
Date:	2010-05-25 20:06:49
Message-ID:	28813.1274818009@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I dug through this and have concluded that it's really an oversight in
the patch I wrote some years ago in response to this:
http://archives.postgresql.org/pgsql-performance/2005-05/msg00219.php

That patch taught nodeMergejoin that a row containing a NULL key can't
possibly match anything on the other side. However, its response to
observing a NULL is just to advance to the next row of that input.
What we should do, if the NULL is in the first merge column and the sort
order is nulls-high, is realize that every following row in that input
must also contain a NULL and so we can just terminate the mergejoin
immediately. The original patch works well for cases where there are
just a few nulls in one input and the important factor is to not read
all the rest of the other input --- but it fails to cover the case where
there are many nulls and the important factor is to not read all the
rest of the nulls. The problem can be demonstrated if you modify the
example given in the above-referenced message so that table t1 contains
lots of nulls rather than just a few: explain analyze will show that
all of t1 gets read by the mergejoin, and that's not necessary.

I'm inclined to think this is a performance bug and should be
back-patched, assuming the fix is simple (which I think it is, but
haven't coded/tested yet). It'd probably be reasonable to go back to
8.3; before that, sorting nulls high versus nulls low was pretty poorly
defined and so there'd be risk of breaking cases that gave the right
answers before.

Comments?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-28 01:16:26
Message-ID:	15062.1275009386@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I've committed a patch for this, if you're interested in testing that
it fixes your situation.

regards, tom lane

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-28 02:46:21
Message-ID:	AANLkTinBVZmcknJjIQGr_suUnlOoBE8uxG-vGfiA7KWQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, May 27, 2010 at 7:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
>> So, Tom, so you think it's possible that the planner isn't noticing
>> all those nulls and thinks it'll just take a row or two to get to the
>> value it needs to join on?
>
> I've committed a patch for this, if you're interested in testing that
> it fixes your situation.

Cool, do we have a snapshot build somewhere or do I need to get all
the extra build bits like flex or yacc or bison or whatnot?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: merge join killing performance
Date:	2010-05-28 02:56:12
Message-ID:	16317.1275015372@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> On Thu, May 27, 2010 at 7:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I've committed a patch for this, if you're interested in testing that
>> it fixes your situation.

> Cool, do we have a snapshot build somewhere or do I need to get all
> the extra build bits like flex or yacc or bison or whatnot?

There's a nightly snapshot tarball of HEAD on the ftp server.
I don't believe there's any snapshots for back branches though.

Alternatively, you could grab the latest release tarball for whichever
branch you want and just apply that patch --- it should apply cleanly.

regards, tom lane