Quick Links

Re: Weird index or sort behaviour

Lists:	pgsql-performance

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Weird index or sort behaviour
Date:	2009-08-18 13:20:21
Message-ID:	alpine.DEB.2.00.0908181348560.19472@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I'm seeing some interesting behaviour. I'm executing a query where I
perform a merge join between two copies of the same table, completely
symmetrically, and the two sides of the merge are sourced differently.

SELECT COUNT(*)
FROM
(SELECT DISTINCT
l1.objectid,
l1.id AS id1,
l1.intermine_start AS start1,
l1.intermine_end AS end1,
l2.id AS id2,
l2.intermine_start AS start2,
l2.intermine_end AS end2
FROM
locationbin8000 l1,
locationbin8000 l2
WHERE
l1.subjecttype = 'GeneFlankingRegion'
AND l2.subjecttype = 'GeneFlankingRegion'
AND l1.objectid = l2.objectid
AND l1.bin = l2.bin
) AS a
WHERE
start1 <= end2
AND start2 <= end1;

QUERY PLAN
---------------------------------------------------------
Aggregate
(cost=703459.72..703459.73 rows=1 width=0)
(actual time=43673.526..43673.527 rows=1 loops=1)
-> HashAggregate
(cost=657324.23..677828.89 rows=2050466 width=28)
(actual time=33741.380..42187.885 rows=17564726 loops=1)
-> Merge Join
(cost=130771.22..621441.07 rows=2050466 width=28)
(actual time=456.970..15292.997 rows=21463106 loops=1)
Merge Cond: ((l1.objectid = l2.objectid) AND (l1.bin = l2.bin))
Join Filter: ((l1.intermine_start <= l2.intermine_end) AND (l2.intermine_start <= l1.intermine_end))
-> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
(cost=0.00..72096.78 rows=670733 width=20)
(actual time=0.085..345.834 rows=664588 loops=1)
Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
-> Sort
(cost=130771.22..132448.05 rows=670733 width=20)
(actual time=456.864..3182.638 rows=38231659 loops=1)
Sort Key: l2.objectid, l2.bin
Sort Method: quicksort Memory: 81690kB
-> Bitmap Heap Scan on locationbin8000 l2
(cost=12706.60..65859.76 rows=670733 width=20)
(actual time=107.259..271.026 rows=664588 loops=1)
Recheck Cond: (subjecttype = 'GeneFlankingRegion'::text)
-> Bitmap Index Scan on locationbin8000__subjecttypeid
(cost=0.00..12538.92 rows=670733 width=0)
(actual time=106.327..106.327 rows=664588 loops=1)
Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
Total runtime: 44699.675 ms
(15 rows)

Here is the definition of the locationbin8000 table:

The table is clustered on the locationbin8000__subjectobjectbin index, and
has been analysed.

So you can see, the merge join requires two inputs both ordered by
(objectid, bin), which is readily supplied by the
locationbin8000__subjectobjectbin index, given that I am restricting the
subjecttype of both sides (to the same thing, I might add). Therefore, I
would expect the merge join to feed off two identical index scans. This is
what happens for one of the sides of the merge join, but not the other,
even though the sides are symmetrical.

Does anyone know why it isn't doing two index scans? Given that the cost
of the index scan is half that of the alternative, I'm surprised that it
uses this plan.

I'm using Postgres 8.4.0

Matthew

--
"Interwoven alignment preambles are not allowed."
If you have been so devious as to get this message, you will understand
it, and you deserve no sympathy. -- Knuth, in the TeXbook

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 15:05:23
Message-ID:	8498.1250607923@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> I'm seeing some interesting behaviour. I'm executing a query where I
> perform a merge join between two copies of the same table, completely
> symmetrically, and the two sides of the merge are sourced differently.

This is not as surprising as you think. A mergejoin is *not*
symmetrical between its two inputs: the inner side is subject to being
partially rewound and rescanned when the outer side is advanced to a new
row with the same merge key. This means there is a premium on cheap
rescan for the inner side that doesn't exist for the outer ... and a
sort node is cheaper to rescan than a generic indexscan. It's
impossible to tell from the data you provided whether the planner was
correct to pick a sort over an indexscan for the inner side, but the
fact that it did so is not prima facie evidence of a bug. You could
force choice of the other plan via enable_sort = off and then compare
estimated and actual runtimes to see if the planner got it right.

regards, tom lane

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 16:41:10
Message-ID:	alpine.DEB.2.00.0908181721320.19472@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 18 Aug 2009, Tom Lane wrote:
> Matthew Wakeling <matthew(at)flymine(dot)org> writes:
>> I'm seeing some interesting behaviour. I'm executing a query where I
>> perform a merge join between two copies of the same table, completely
>> symmetrically, and the two sides of the merge are sourced differently.
>
> This is not as surprising as you think. A mergejoin is *not*
> symmetrical between its two inputs: the inner side is subject to being
> partially rewound and rescanned when the outer side is advanced to a new
> row with the same merge key. This means there is a premium on cheap
> rescan for the inner side that doesn't exist for the outer ... and a
> sort node is cheaper to rescan than a generic indexscan.

Very clever. Yes, that is what is happening. I'm surprised that the system
doesn't buffer the inner side to avoid having to rescan each time, but
then I guess you would have problems if the buffer grew larger than
memory.

> It's impossible to tell from the data you provided whether the planner
> was correct to pick a sort over an indexscan for the inner side, but the
> fact that it did so is not prima facie evidence of a bug. You could
> force choice of the other plan via enable_sort = off and then compare
> estimated and actual runtimes to see if the planner got it right.

Yes, it does get an almost unmeasureable amount slower if I force sorts
off and nested loop (its next choice) off.

Matthew

--
$ rm core
Segmentation Fault (core dumped)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 16:49:37
Message-ID:	24785.1250614177@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> Very clever. Yes, that is what is happening. I'm surprised that the system
> doesn't buffer the inner side to avoid having to rescan each time, but
> then I guess you would have problems if the buffer grew larger than
> memory.

Well, it does consider adding a Materialize node for that purpose,
but in this case it evidently thought a sort was cheaper.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 16:57:29
Message-ID:	24934.1250614649@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I wrote:
> Matthew Wakeling <matthew(at)flymine(dot)org> writes:
>> Very clever. Yes, that is what is happening. I'm surprised that the system
>> doesn't buffer the inner side to avoid having to rescan each time, but
>> then I guess you would have problems if the buffer grew larger than
>> memory.

> Well, it does consider adding a Materialize node for that purpose,
> but in this case it evidently thought a sort was cheaper.

Hmmm ... actually, after looking at the code, I notice that we only
consider adding a Materialize node to buffer an inner input that is a
Sort node. The idea was suggested by Greg Stark, if memory serves.
I wonder now if it'd be worthwhile to generalize that to consider
adding a Materialize above *any* inner mergejoin input.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 17:44:21
Message-ID:	407d949e0908181044g3557ab5bw93886a06a6b374b4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, Aug 18, 2009 at 5:57 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Hmmm ... actually, after looking at the code, I notice that we only
> consider adding a Materialize node to buffer an inner input that is a
> Sort node. The idea was suggested by Greg Stark, if memory serves.
> I wonder now if it'd be worthwhile to generalize that to consider
> adding a Materialize above *any* inner mergejoin input.

If my recollection is right the reason we put the materialize above
the sort node has to do with Simon's deferred final merge pass
optimization. The materialize was a way to lazily build the final
merge as we do the merge but still have the ability to rewind.

I would be more curious in the poster's situation to turn off
enable_seqscan, enable_sort, and/or enable_nestloop see how the index
scan merge join plan runs. rewinding an index scan is more expensive
than rewinding a materialize node but would it really be so much
expensive that it's worth copying the entire table into temporary
space?

--
greg
http://mit.edu/~gsstark/resume.pdf

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 17:57:11
Message-ID:	26185.1250618231@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> If my recollection is right the reason we put the materialize above
> the sort node has to do with Simon's deferred final merge pass
> optimization. The materialize was a way to lazily build the final
> merge as we do the merge but still have the ability to rewind.

> I would be more curious in the poster's situation to turn off
> enable_seqscan, enable_sort, and/or enable_nestloop see how the index
> scan merge join plan runs. rewinding an index scan is more expensive
> than rewinding a materialize node but would it really be so much
> expensive that it's worth copying the entire table into temporary
> space?

Absolutely not, but remember that what we're expecting the Materialize
to do is buffer only as far back as the last Mark, so that it's unlikely
ever to spill to disk. It might well be a win to do that rather than
re-fetching from the indexscan. The incremental win compared to not
having the materialize would be small compared to what it is for a sort,
but it could still be worthwhile I think. In particular, in Matthew's
example the sort is being estimated at significantly higher cost than
the indexscan, which presumably means that we are estimating there will
be a *lot* of re-fetches, else we wouldn't have rejected the indexscan
on the inside. Inserting a materialize would make the re-fetches
cheaper. I'm fairly sure that this plan structure would cost out
cheaper than the sort according to cost_mergejoin's cost model. As
noted in the comments therein, that cost model is a bit oversimplified,
so it might not be cheaper in reality ... but we ought to try it.

regards, tom lane

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 18:40:18
Message-ID:	alpine.DEB.2.00.0908181901220.19472@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 18 Aug 2009, Tom Lane wrote:
>> I would be more curious in the poster's situation to turn off
>> enable_seqscan, enable_sort, and/or enable_nestloop see how the index
>> scan merge join plan runs.

Like this:

QUERY PLAN
-----------------------------------------------------------------------
Aggregate
(cost=2441719.92..2441719.93 rows=1 width=0)
(actual time=50087.537..50087.538 rows=1 loops=1)
-> HashAggregate
(cost=2397366.95..2417079.38 rows=1971243 width=28)
(actual time=40462.069..48634.713 rows=17564726 loops=1)
-> Merge Join
(cost=0.00..2362870.20 rows=1971243 width=28)
(actual time=0.095..22041.693 rows=21463106 loops=1)
Merge Cond: ((l1.objectid = l2.objectid) AND (l1.bin = l2.bin))
Join Filter: ((l1.intermine_start <= l2.intermine_end) AND (l2.intermine_start <= l1.intermine_end))
-> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
(cost=0.00..71635.23 rows=657430 width=20)
(actual time=0.056..170.857 rows=664588 loops=1)
Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
-> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
(cost=0.00..71635.23 rows=657430 width=20)
(actual time=0.020..9594.466 rows=38231659 loops=1)
Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)
Total runtime: 50864.569 ms
(10 rows)

>> rewinding an index scan is more expensive than rewinding a materialize
>> node but would it really be so much expensive that it's worth copying
>> the entire table into temporary space?
>
> Absolutely not, but remember that what we're expecting the Materialize
> to do is buffer only as far back as the last Mark, so that it's unlikely
> ever to spill to disk.

If that's how it works, then that sounds very promising indeed.

> In particular, in Matthew's example the sort is being estimated at
> significantly higher cost than the indexscan, which presumably means
> that we are estimating there will be a *lot* of re-fetches, else we
> wouldn't have rejected the indexscan on the inside.

select sum(c * c) / sum(c) from (select objectid, bin, count(*) AS c from
locationbin8000 where subjecttype = 'GeneFlankingRegion' GROUP BY
objectid, bin) as a;
?column?
---------------------
57.5270393085641029

So on average, we will be rewinding by 57 rows each time. A materialise
step really does sound like a win in this situation.

Matthew

--
Patron: "I am looking for a globe of the earth."
Librarian: "We have a table-top model over here."
Patron: "No, that's not good enough. Don't you have a life-size?"
Librarian: (pause) "Yes, but it's in use right now."

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-18 19:09:52
Message-ID:	27347.1250622592@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> -> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
> (cost=0.00..71635.23 rows=657430 width=20)
> (actual time=0.056..170.857 rows=664588 loops=1)
> Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
> -> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
> (cost=0.00..71635.23 rows=657430 width=20)
> (actual time=0.020..9594.466 rows=38231659 loops=1)
> Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)

> ... So on average, we will be rewinding by 57 rows each time.

As indeed is reflected in those actual rowcounts. (The estimated
counts and costs don't include re-fetching, but the actuals do.)

Even more interesting, the actual runtime is about 56x different too,
which implies that Matthew's re-fetches are not noticeably cheaper than
the original fetches. I'd be surprised if that were true in an
indexscan pulling from disk (you'd expect recently-touched rows to stay
cached for awhile). But it could easily be true if the whole table were
cached already. Matthew, how big is this table compared to your RAM?
Were you testing a case in which it'd be in cache?

regards, tom lane

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-08-19 11:00:32
Message-ID:	alpine.DEB.2.00.0908191155150.19472@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 18 Aug 2009, Tom Lane wrote:
>> -> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l1
>> (cost=0.00..71635.23 rows=657430 width=20)
>> (actual time=0.056..170.857 rows=664588 loops=1)
>> Index Cond: (subjecttype = 'GeneFlankingRegion'::text)
>> -> Index Scan using locationbin8000__subjectobjectbin on locationbin8000 l2
>> (cost=0.00..71635.23 rows=657430 width=20)
>> (actual time=0.020..9594.466 rows=38231659 loops=1)
>> Index Cond: (l2.subjecttype = 'GeneFlankingRegion'::text)
>
>> ... So on average, we will be rewinding by 57 rows each time.
>
> As indeed is reflected in those actual rowcounts. (The estimated
> counts and costs don't include re-fetching, but the actuals do.)
>
> Even more interesting, the actual runtime is about 56x different too,
> which implies that Matthew's re-fetches are not noticeably cheaper than
> the original fetches. I'd be surprised if that were true in an
> indexscan pulling from disk (you'd expect recently-touched rows to stay
> cached for awhile). But it could easily be true if the whole table were
> cached already. Matthew, how big is this table compared to your RAM?
> Were you testing a case in which it'd be in cache?

Oh, definitely. I have run this test so many times, it's all going to be
in the cache. Luckily, that's what we are looking at as a normal situation
in production. Also, since the table is clustered on that index, I would
expect the performance when it is out of cache to be fairly snappy anyway.

For reference, the table is 350 MB, the index is 238 MB, and the RAM in
the machine is 4GB (although it's my desktop so it'll have all sorts of
other rubbish using that up). Our servers have 16GB to 32GB of RAM, so no
problem there.

Matthew

--
I'm always interested when [cold callers] try to flog conservatories.
Anyone who can actually attach a conservatory to a fourth floor flat
stands a marginally better than average chance of winning my custom.
(Seen on Usenet)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-11-15 02:50:21
Message-ID:	9131.1258253421@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> [ discussion about applying materialize to a mergejoin's inner indexscan ]

I have finally gotten round to doing something about this, and applied
the attached patch to CVS HEAD. Could you test it on your problem case
to see what happens? If it's not convenient to load your data into
8.5devel, I believe the patch would work all right in 8.4.x. (A quick
check shows that it applies except for one deletion hunk that has a
conflict due to a comment change; you could easily do that deletion
manually.)

regards, tom lane

Attachment	Content-Type	Size
mergejoin-materialize.patch	text/x-patch	19.7 KB

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird index or sort behaviour
Date:	2009-11-18 11:39:03
Message-ID:	alpine.DEB.2.00.0911181136450.684@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Sat, 14 Nov 2009, Tom Lane wrote:
> Matthew Wakeling <matthew(at)flymine(dot)org> writes:
>> [ discussion about applying materialize to a mergejoin's inner indexscan ]
>
> I have finally gotten round to doing something about this, and applied
> the attached patch to CVS HEAD. Could you test it on your problem case
> to see what happens? If it's not convenient to load your data into
> 8.5devel, I believe the patch would work all right in 8.4.x. (A quick
> check shows that it applies except for one deletion hunk that has a
> conflict due to a comment change; you could easily do that deletion
> manually.)

Um, cool. I'm going to have to look back at the archives to even work out
what query I was complaining about though. May take a little while to
test, but I'll get back to you. Thanks,

Matthew

--
The only secure computer is one that's unplugged, locked in a safe,
and buried 20 feet under the ground in a secret location...and i'm not
even too sure about that one. --Dennis Huges, FBI