Proposed patch to make mergejoin cost estimation more symmetric

Lists: pgsql-patches
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-patches(at)postgreSQL(dot)org
Subject: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 00:19:27
Message-ID: 4810.1196986767@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

There was some discussion earlier today
http://archives.postgresql.org/pgsql-performance/2007-12/msg00081.php
about how mergejoin cost estimation is not smart about the situation
where we will have to scan a lot of rows on one side of the join before
reaching the range of values found on the other side (and hence having
some chance of finding join pairs). Ideally, that effect should be
factored into the startup cost estimate for the join. This
consideration is the exact dual of one that we already do have code
for, namely that the merge can stop early if the range of values on
one side ends long before that of the other. So I looked into whether
that code couldn't be extended to handle this issue too. It turns out
that it can be, and it actually becomes more symmetrical because it's
considering both max and min values not just max.

Also, I found that there were a couple of new-in-8.3 bugs in that area
--- it's been extended to make estimates for descending-order
mergejoins, but it wasn't getting that quite right.

Since something needs to be done to that code anyway, I'm considering
applying the attached patch. It's a bit larger than I would normally
consider to be a good idea for an optional patch in late beta. But then
again I wouldn't have the slightest hesitation about back-patching a
change of this size after final release, so it seems attractive to put
it in now.

Aside from the patch, I have attached a test script that exercises merge
join planning for some simple cases, and the plans output by the script
in CVS HEAD with/without the patch. The cost estimates with the patch
are in line with expectation, the estimates without, not so much.
In particular, the existing bug can be seen at work here in that the
sixth and eighth test cases ("big join highm on b=h order by b desc" and
"big join high on b=h order by b desc") are given unreasonably small
cost estimates by the unpatched code. (Note: the two sets of numbers
vary a bit because they were working with nonidentical ANALYZE
statistics.)

Any objections to applying the patch?

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 32.3 KB
unknown_filename text/plain 1.1 KB
unknown_filename text/plain 3.8 KB
unknown_filename text/plain 3.8 KB

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 08:29:35
Message-ID: 1197016175.4255.466.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Thu, 2007-12-06 at 19:19 -0500, Tom Lane wrote:

> Since something needs to be done to that code anyway, I'm considering
> applying the attached patch. It's a bit larger than I would normally
> consider to be a good idea for an optional patch in late beta. But then
> again I wouldn't have the slightest hesitation about back-patching a
> change of this size after final release, so it seems attractive to put
> it in now.
>
> Aside from the patch, I have attached a test script that exercises merge
> join planning for some simple cases, and the plans output by the script
> in CVS HEAD with/without the patch. The cost estimates with the patch
> are in line with expectation, the estimates without, not so much.
> In particular, the existing bug can be seen at work here in that the
> sixth and eighth test cases ("big join highm on b=h order by b desc" and
> "big join high on b=h order by b desc") are given unreasonably small
> cost estimates by the unpatched code. (Note: the two sets of numbers
> vary a bit because they were working with nonidentical ANALYZE
> statistics.)

Looks good.

> Any objections to applying the patch?

None.

I do have a longer term issue that there is no information provided by
EXPLAIN to allow us to differentiate these two conditions. That makes it
harder to understand the basis of the plans and also gets everybody used
to seeing EXPLAINs that can't easily be explained, which leads to people
not reporting problems that exist. I doubt we can fix anything now, but
increased debugging/logging output is definitely required in some form.

> explain select * from big join highm on b=h order by b desc;
> QUERY PLAN
> ---------------------------------------------------------------------------------------
> Merge Join (cost=298.67..387.53 rows=1000 width=8)
> Merge Cond: (big.b = highm.h)
> -> Index Scan Backward using bigi on big (cost=0.00..3050.26 rows=100000 width=4)
> -> Index Scan Backward using highmi on highm (cost=0.00..43.25 rows=1000 width=4)
> (4 rows)

> explain select * from big join high on b=h order by b desc;
> QUERY PLAN
> ---------------------------------------------------------------------------------------
> Merge Join (cost=0.05..88.19 rows=1000 width=8)
> Merge Cond: (big.b = high.h)
> -> Index Scan Backward using bigi on big (cost=0.00..3050.26 rows=100000 width=4)
> -> Index Scan Backward using highi on high (cost=0.00..43.25 rows=1000 width=4)
> (4 rows)

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-patches(at)postgreSQL(dot)org>
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 09:15:54
Message-ID: 87bq92q305.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Any objections to applying the patch?

I like applying it.

You don't include any negative tests and corner cases as well; cases where the
new code shouldn't be disturbing the results such as a symmetric join or a
join against a single-row table. The comments make me think you ran them but
just didn't show them though.

What about a merge join against an empty table? I suppose there would just be
no statistics?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 14:48:42
Message-ID: 19574.1197038922@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I do have a longer term issue that there is no information provided by
> EXPLAIN to allow us to differentiate these two conditions.

Sure there is: the new startup condition affects the estimated plan
startup cost (only) and the existing termination condition affects the
estimated total cost (only).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 14:55:12
Message-ID: 19673.1197039312@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> What about a merge join against an empty table? I suppose there would just be
> no statistics?

Yeah. The defenses against silly results in mergejoinscansel should be
enough to prevent it from doing anything really insane.

regards, tom lane


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 14:57:49
Message-ID: 20071207145749.GC5192@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I do have a longer term issue that there is no information provided by
> > EXPLAIN to allow us to differentiate these two conditions.
>
> Sure there is: the new startup condition affects the estimated plan
> startup cost (only) and the existing termination condition affects the
> estimated total cost (only).

Yes, but how can you tell by looking at the explain? I think the point
is that the "fraction that would be skipped" should be reported somehow.

--
Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC
Criptografía: Poderosa técnica algorítmica de codificación que es
empleada en la creación de manuales de computadores.


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 15:14:45
Message-ID: 1197040485.4255.579.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Fri, 2007-12-07 at 09:48 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I do have a longer term issue that there is no information provided by
> > EXPLAIN to allow us to differentiate these two conditions.
>
> Sure there is: the new startup condition affects the estimated plan
> startup cost (only) and the existing termination condition affects the
> estimated total cost (only).

I think we all accept that you're the master here. The question is how
will we know to report problems to you? If we all get used to the idea
that sometimes the numbers vary, then we're in the situation where we
have to say to people "its magic". That destroys any feedback loop you
have for problem reporting and then you get out of touch with what is
happening on the ground.

It's a long term issue, so I have no complaints about what you've done,
its just something to think about. I've been exactly here before in one
of my past lives and I don't want to reinvent that particular wheel.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Proposed patch to make mergejoin cost estimation more symmetric
Date: 2007-12-07 15:15:35
Message-ID: 19957.1197040535@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Yes, but how can you tell by looking at the explain? I think the point
> is that the "fraction that would be skipped" should be reported somehow.

It is: it's directly reflected in the startup cost. Previously, a
mergejoin would always have startup cost equal to the sum of its
input startup costs (hence, zero in the cases of interest here).

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-patches(at)postgreSQL(dot)org>
Subject: Re: Proposed patch to make mergejoin cost estimationmore symmetric
Date: 2007-12-08 06:57:34
Message-ID: 87tzmtoeqp.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> I think we all accept that you're the master here. The question is how
> will we know to report problems to you?

Hm, I think I agree. The problem is where to draw the line. Ultimately
everything in the statistics tables and more could potentially be relevant.

The other problem is that currently our explain output is a bit of a hack.
It's just text we print out and we're limited in how much data we can put in
that without it being unwieldy.

When we get the table-based or xml-based or some other machine-readable
explain plan we could stuff a lot more data in there and have it be the UI's
responsibility to decide what data to display.

When that happens it would be nice to have the raw data used to generate the
cost estimations. At least the most important factors.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!