Parameterized-path cost comparisons need some work

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Parameterized-path cost comparisons need some work
Date: 2012-02-28 22:35:14
Message-ID: 29248.1330468514@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I was experimenting today with a test case sent to me by somebody at Red
Hat. The details aren't too important, except that it involves an inner
join on the inside of a left join (where it cannot commute with the left
join). I can reproduce the behavior with a standard regression test
table, if I crank random_page_cost up a bit:

regression=# set random_page_cost TO 5;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1 t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1 = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..507.16 rows=100 width=732)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..10.28 rows=1 width=244)
Index Cond: (unique1 = 1)
-> Nested Loop (cost=0.00..495.63 rows=100 width=488)
-> Index Scan using tenk1_hundred on tenk1 t2 (cost=0.00..446.98 rows=100 width=244)
Index Cond: (t1.hundred = hundred)
-> Index Scan using tenk1_unique2 on tenk1 t3 (cost=0.00..0.48 rows=1 width=244)
Index Cond: (unique2 = t2.thousand)
(8 rows)

regression=# set random_page_cost TO 6;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1 t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1 = 1;
QUERY PLAN
---------------------------------------------------------------------------------------------
Hash Right Join (cost=928.30..2573.80 rows=100 width=732)
Hash Cond: (t2.hundred = t1.hundred)
-> Hash Join (cost=916.00..2523.00 rows=10000 width=488)
Hash Cond: (t2.thousand = t3.unique2)
-> Seq Scan on tenk1 t2 (cost=0.00..458.00 rows=10000 width=244)
-> Hash (cost=458.00..458.00 rows=10000 width=244)
-> Seq Scan on tenk1 t3 (cost=0.00..458.00 rows=10000 width=244)
-> Hash (cost=12.29..12.29 rows=1 width=244)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..12.29 rows=1 width=244)
Index Cond: (unique1 = 1)
(10 rows)

WTF? How did it suddenly fail to find the double-nested-loop plan, and
have to settle for a much worse plan?

The reason seems to be that for large enough random_page_cost, the
seqscan on t2 is costed as cheaper than an indexscan that selects a
significant fraction of the table. (Notice the t2 scans are nearly the
same cost in these two examples.) That causes add_path to decide that
the unparameterized seqscan is unconditionally better than the
parameterized indexscan, and throw out the latter. Now it is impossible
to form the parameterized nestloop subplan.

The flaw in this logic, of course, is that the seqscan might be cheaper
than the parameterized indexscan, but *it produces a whole lot more
rows*, meaning that any subsequent join will be a lot more expensive.
Previously add_path didn't have to worry about that, because all
ordinary paths for a given relation produce the same number of rows
(and we studiously kept things like inner indexscan paths out of
add_path's view of the world).

The most obvious thing to do about this is to consider that one path can
dominate another on cost only if it is both cheaper *and* produces no
more rows. But I'm concerned about the cost of inserting yet another
test condition into add_path, which is slow enough already. Has anyone
got an idea for a different formulation that would avoid that?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 08:18:05
Message-ID: CA+U5nM+3ZXfg0BCAim4tVz9=b3Y9qot_7J6L3riCV85gdsLYSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2012 at 10:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The flaw in this logic, of course, is that the seqscan might be cheaper
> than the parameterized indexscan, but *it produces a whole lot more
> rows*, meaning that any subsequent join will be a lot more expensive.
> Previously add_path didn't have to worry about that, because all
> ordinary paths for a given relation produce the same number of rows
> (and we studiously kept things like inner indexscan paths out of
> add_path's view of the world).
>
> The most obvious thing to do about this is to consider that one path can
> dominate another on cost only if it is both cheaper *and* produces no
> more rows.  But I'm concerned about the cost of inserting yet another
> test condition into add_path, which is slow enough already.  Has anyone
> got an idea for a different formulation that would avoid that?

It seems clear that we shouldn't be making that decision at that point.

It would be better to default towards processing fewer rows initially
and then swoop in later to improve decision making on larger plans.
Can't we save the SeqScan costs at every node, then re-add SeqScan
plans as a post-processing step iff the index/nestd loops plans appear
costly? So have an additional post processing step that only cuts in
with larger plans.

Seqscan plans are bad for many reasons, such as pushing data out of
cache, making the result more sensitive to growing data volumes or
selectivity mistakes as well as producing confusing stats for people
trying to add the right indexes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 09:28:04
Message-ID: 003601ccf6c4$7117a660$5346f320$%kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>The most obvious thing to do about this is to consider that one path can
>dominate another on cost only if it is both cheaper *and* produces no
>more rows. But I'm concerned about the cost of inserting yet another
>test condition into add_path, which is slow enough already. Has anyone
>got an idea for a different formulation that would avoid that?

Will it discard the seq. scan path if the number of rows is more and cost is
less or will it keep both paths and then later based on cost estimation for
join it discards one of them?
Can it be like if seq. scan has low cost and number of rows is only greater
by certain thresh-hold, only then seq. scan will be kept else index scan
will dominate.

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Sent: Wednesday, February 29, 2012 4:05 AM
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: [HACKERS] Parameterized-path cost comparisons need some work

I was experimenting today with a test case sent to me by somebody at Red
Hat. The details aren't too important, except that it involves an inner
join on the inside of a left join (where it cannot commute with the left
join). I can reproduce the behavior with a standard regression test
table, if I crank random_page_cost up a bit:

regression=# set random_page_cost TO 5;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1
t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1
= 1;
QUERY PLAN

----------------------------------------------------------------------------
--------------------
Nested Loop Left Join (cost=0.00..507.16 rows=100 width=732)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..10.28 rows=1
width=244)
Index Cond: (unique1 = 1)
-> Nested Loop (cost=0.00..495.63 rows=100 width=488)
-> Index Scan using tenk1_hundred on tenk1 t2 (cost=0.00..446.98
rows=100 width=244)
Index Cond: (t1.hundred = hundred)
-> Index Scan using tenk1_unique2 on tenk1 t3 (cost=0.00..0.48
rows=1 width=244)
Index Cond: (unique2 = t2.thousand)
(8 rows)

regression=# set random_page_cost TO 6;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1
t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1
= 1;
QUERY PLAN

----------------------------------------------------------------------------
-----------------
Hash Right Join (cost=928.30..2573.80 rows=100 width=732)
Hash Cond: (t2.hundred = t1.hundred)
-> Hash Join (cost=916.00..2523.00 rows=10000 width=488)
Hash Cond: (t2.thousand = t3.unique2)
-> Seq Scan on tenk1 t2 (cost=0.00..458.00 rows=10000 width=244)
-> Hash (cost=458.00..458.00 rows=10000 width=244)
-> Seq Scan on tenk1 t3 (cost=0.00..458.00 rows=10000
width=244)
-> Hash (cost=12.29..12.29 rows=1 width=244)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..12.29
rows=1 width=244)
Index Cond: (unique1 = 1)
(10 rows)

WTF? How did it suddenly fail to find the double-nested-loop plan, and
have to settle for a much worse plan?

The reason seems to be that for large enough random_page_cost, the
seqscan on t2 is costed as cheaper than an indexscan that selects a
significant fraction of the table. (Notice the t2 scans are nearly the
same cost in these two examples.) That causes add_path to decide that
the unparameterized seqscan is unconditionally better than the
parameterized indexscan, and throw out the latter. Now it is impossible
to form the parameterized nestloop subplan.

The flaw in this logic, of course, is that the seqscan might be cheaper
than the parameterized indexscan, but *it produces a whole lot more
rows*, meaning that any subsequent join will be a lot more expensive.
Previously add_path didn't have to worry about that, because all
ordinary paths for a given relation produce the same number of rows
(and we studiously kept things like inner indexscan paths out of
add_path's view of the world).

The most obvious thing to do about this is to consider that one path can
dominate another on cost only if it is both cheaper *and* produces no
more rows. But I'm concerned about the cost of inserting yet another
test condition into add_path, which is slow enough already. Has anyone
got an idea for a different formulation that would avoid that?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 18:12:08
Message-ID: CA+Tgmob8OGcdA1vYzFJP-SWoiyN16rE-i6wUVSALTSF=XQQw5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2012 at 5:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The most obvious thing to do about this is to consider that one path can
> dominate another on cost only if it is both cheaper *and* produces no
> more rows.

Hmm. Are you sure that's the right rule? I am having trouble
constructing an example, but I feel like there might be cases where
it's possible to have path A, not parameterized, path B, parameterized
by R, and path C, parameterized by S, and maybe also path D,
parameterized by both R and S. In that case, I feel like paths B and
C are incomparable. Even if one is cheaper than the other and also
produces fewer rows, they're not the same rows; so the expense of a
subsequent join might be different with one vs. the other.

Even in the simple case where there's only one possible
parameterization, it seems very difficult to compare the cost of a
parameterized path A with the cost of an unparameterized path B. No
matter how much cheaper A is than B, the eventual nested loop join
might be so inefficient as to completely wipe A out. On the flip
side, even if A is more expensive than B, it's nearly always going to
produce at least somewhat fewer rows. There are degenerate cases
where that might not be true, like a parameterized join where the
table we're index-scanning has only one value in the paramaterized
column, but presumably that's rare. So it may be worth keeping A
around just because the subsequent join could turn out to be much
cheaper with the smaller row count. Thus it seems unlikely that we'll
be able to conclude that either A or B is categorically superior.

If you accept that reasoning, then it seems like there's little if any
point in ever comparing a parameterized path to a non-parameterized
path; or of comparing two differently-parameterized paths against each
other. You basically need a separate bucket for each group of paths,
where they compete against each other but not against
differently-parameterized paths; and then after you form the nested
loop, all the paths that are now parameterized the same way (but
weren't before) can finally go head-to-head against each other.

I almost wonder if the bottom-up approach is the wrong way to tackle
this entirely. Suppose we're trying to build paths for {A B}. We
first build unparameterized paths for {A} and {B} and compute join
paths for {A B}. Now we know the cheapest way to build {A B} without
using parameterization, so we can make some deductions about a plan of
the form:

Nested Loop
-> A
-> B (parameterized by A)

In particular, if we take the best cost either overall or for a path
with the pathkeys that will be produced by this join, and divide by
the number of rows for A, a parameterized path for B only makes sense
if the total cost is less than that value. Now we have a meaningful
bound that we can use to limit consideration of paths for B: anything
that's more expensive than that bound should be chucked. If B is just
a single rel that's not that interesting, but if B is a joinrel, then
the bound applies individually to any subset of its members. So we
start replanning B as a separate join problem, but any path for any
baserel or joinrel that exceeds our cutoff cost gets chucked; and if
any rel ends up with no workable paths, then we just forget the whole
thing.

Of course, the problem with this approach (aside from complexity) is
that you might end up planning the same subproblem multiple times with
different cost bounds. You could cache the results of previous
computations so that you only replan it when the cost bound goes up,
but that's still pretty awful. So maybe this is unworkable. But the
current approach seems problematic, too, because you don't really know
enough to throw very much away at the time that you need to make that
decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 18:40:05
Message-ID: 4457.1330540805@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Feb 28, 2012 at 5:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The most obvious thing to do about this is to consider that one path can
>> dominate another on cost only if it is both cheaper *and* produces no
>> more rows.

> Hmm. Are you sure that's the right rule?

On further reflection it had seemed like we might have to treat the
rowcount as another independent metric of value.

> I am having trouble
> constructing an example, but I feel like there might be cases where
> it's possible to have path A, not parameterized, path B, parameterized
> by R, and path C, parameterized by S, and maybe also path D,
> parameterized by both R and S. In that case, I feel like paths B and
> C are incomparable.

Indeed, and the code already knows that. However, in this example, path
A is capable of dominating the other three (being strictly less
parameterized than them), and B and C are each capable of dominating D.
The problem is just that I'd neglected to consider that rowcount now
also becomes a figure of merit.

> Even if one is cheaper than the other and also
> produces fewer rows, they're not the same rows; so the expense of a
> subsequent join might be different with one vs. the other.

Yeah, if the quals being used are different, that's possible in
principle; but I don't foresee being able to estimate such a thing.
I think just using the number of rows is as good as we're likely to
be able to do.

> Even in the simple case where there's only one possible
> parameterization, it seems very difficult to compare the cost of a
> parameterized path A with the cost of an unparameterized path B.

I don't believe this. We can certainly compare costs, and we can
certainly compare rowcount. The possibility that the specific set
of rows selected might affect subsequent join costs seems to me to
be too second-order to justify ignoring these first-order differences.

The variant of the argument that had occurred to me is that if you
consider that rowcount is an independent metric, then it might be that a
less-parameterized path can never dominate a more-parameterized path,
because even if the former is cheaper it should always generate more
rows. However, I'm not convinced that I believe that --- in particular,
with a poor choice of parameterizing quals it might not be true.
If we did have a more-parameterized path that wasn't any more selective,
we'd really want add_path to notice that and throw it out.

> I almost wonder if the bottom-up approach is the wrong way to tackle
> this entirely.

I don't find this idea too attractive ... and in any case we're not
doing rm -rf src/backend/optimizer/ and start over for 9.2.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 19:02:42
Message-ID: CA+Tgmobp-nAtPjxGCwhVq1=Zn1XXKpfj-G2jcyY6KmHOsOS-qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 1:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I am having trouble
>> constructing an example, but I feel like there might be cases where
>> it's possible to have path A, not parameterized, path B, parameterized
>> by R, and path C, parameterized by S, and maybe also path D,
>> parameterized by both R and S.  In that case, I feel like paths B and
>> C are incomparable.
>
> Indeed, and the code already knows that.  However, in this example, path
> A is capable of dominating the other three (being strictly less
> parameterized than them), and B and C are each capable of dominating D.
> The problem is just that I'd neglected to consider that rowcount now
> also becomes a figure of merit.

In theory, yes, but in practice, won't it nearly always be the case
that a less parameterized path generates more rows, and a more
parameterized path generates less rows, so that neither dominates the
other?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 19:08:43
Message-ID: 4952.1330542523@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Feb 29, 2012 at 1:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Indeed, and the code already knows that. However, in this example, path
>> A is capable of dominating the other three (being strictly less
>> parameterized than them), and B and C are each capable of dominating D.
>> The problem is just that I'd neglected to consider that rowcount now
>> also becomes a figure of merit.

> In theory, yes, but in practice, won't it nearly always be the case
> that a less parameterized path generates more rows, and a more
> parameterized path generates less rows, so that neither dominates the
> other?

I think you're just assuming that without any solid evidence. My point
is precisely that if the more-parameterized path *fails* to generate
fewer rows, we want add_path to notice that and throw it out (or at
least be able to throw it out, if there's not another reason to keep it).

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 22:40:23
Message-ID: CA+TgmoYgmvS7oKF5_r6ManXR1xd3JQW1L0Po=NEC0xzcKQXTyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Feb 29, 2012 at 1:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Indeed, and the code already knows that.  However, in this example, path
>>> A is capable of dominating the other three (being strictly less
>>> parameterized than them), and B and C are each capable of dominating D.
>>> The problem is just that I'd neglected to consider that rowcount now
>>> also becomes a figure of merit.
>
>> In theory, yes, but in practice, won't it nearly always be the case
>> that a less parameterized path generates more rows, and a more
>> parameterized path generates less rows, so that neither dominates the
>> other?
>
> I think you're just assuming that without any solid evidence.  My point
> is precisely that if the more-parameterized path *fails* to generate
> fewer rows, we want add_path to notice that and throw it out (or at
> least be able to throw it out, if there's not another reason to keep it).

Well, my "evidence" is that a parameterized path should pretty much
always include a paramaterized path somewhere in there - otherwise,
what is parameterization doing for us? And that's going to reduce the
row count. I may be missing something, but I'm confused as to why
this isn't nearly tautological.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 23:01:44
Message-ID: 25460.1330556504@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Feb 29, 2012 at 2:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I think you're just assuming that without any solid evidence. My point
>> is precisely that if the more-parameterized path *fails* to generate
>> fewer rows, we want add_path to notice that and throw it out (or at
>> least be able to throw it out, if there's not another reason to keep it).

> Well, my "evidence" is that a parameterized path should pretty much
> always include a paramaterized path somewhere in there - otherwise,
> what is parameterization doing for us?

Well, yes, we know that much.

> And that's going to reduce the
> row count. I may be missing something, but I'm confused as to why
> this isn't nearly tautological.

We don't know that --- I will agree it's likely, but that doesn't make
it so certain that we can assume it without checking. A join condition
won't necessarily eliminate any rows.

(... thinks about that for awhile ...) One thing we could possibly do
is have indxpath.c arbitrarily reject parameterizations that don't
produce a smaller estimated number of rows than an unparameterized scan.
Admittedly, this still doesn't *prove* the assumption for join
relations, but maybe it brings the odds to where it's okay for add_path
to make such an assumption.

(... thinks some more ...) No, that doesn't get us there, because that
doesn't establish that a more-parameterized path produces fewer rows
than some path that requires less parameterization, yet not none at
all. You really want add_path carrying out those comparisons. In your
previous example, it's entirely possible that path D is dominated by B
or C because of poor choices of join quals.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-03-01 13:47:18
Message-ID: CA+TgmoY05or-GS3rKPrehVCin7VS6+yYRR0KyQLV_ZfTXK7-4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 6:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, my "evidence" is that a parameterized path should pretty much
>> always include a paramaterized path somewhere in there - otherwise,
>> what is parameterization doing for us?
>
> Well, yes, we know that much.

I didn't write what I meant to write there. I meant to say: a
parameterized path is presumably going to contain a parameterized
*index scan* somewhere within. So somewhere we're going to have
something of the form

-> Index Scan blah on blah
Index Cond: someattr = $1

And if that path weren't parameterized, we'd have to read the whole
relation, either with a full index scan, or a sequential scan. Or, I
mean, maybe there's a filter condition, so that no path needs to
retrieve the *whole* relation, but even there the index cond is on top
of that, and it's probably doing something, though I suppose you're
right that there might be cases where it doesn't.

>> And that's going to reduce the
>> row count.  I may be missing something, but I'm confused as to why
>> this isn't nearly tautological.
>
> We don't know that --- I will agree it's likely, but that doesn't make
> it so certain that we can assume it without checking.  A join condition
> won't necessarily eliminate any rows.
>
> (... thinks about that for awhile ...)  One thing we could possibly do
> is have indxpath.c arbitrarily reject parameterizations that don't
> produce a smaller estimated number of rows than an unparameterized scan.
> Admittedly, this still doesn't *prove* the assumption for join
> relations, but maybe it brings the odds to where it's okay for add_path
> to make such an assumption.

That seems to make sense.

> (... thinks some more ...)  No, that doesn't get us there, because that
> doesn't establish that a more-parameterized path produces fewer rows
> than some path that requires less parameterization, yet not none at
> all.  You really want add_path carrying out those comparisons.  In your
> previous example, it's entirely possible that path D is dominated by B
> or C because of poor choices of join quals.

I'm not following this part. Can you explain further? It seems to me
at any rate that we could get pretty far if we could just separate
parameterized paths and unparameterized paths into separate buckets.
Even if we have to do some extra work when comparing parameterized
paths *to each other*, we'd gain a fair amount by avoiding comparing
any of them with the unparameterized paths. Or at least, I hope so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-03-04 05:20:49
Message-ID: 5189.1330838449@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Feb 29, 2012 at 6:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> (... thinks some more ...) No, that doesn't get us there, because that
>> doesn't establish that a more-parameterized path produces fewer rows
>> than some path that requires less parameterization, yet not none at
>> all. You really want add_path carrying out those comparisons. In your
>> previous example, it's entirely possible that path D is dominated by B
>> or C because of poor choices of join quals.

> I'm not following this part. Can you explain further? It seems to me
> at any rate that we could get pretty far if we could just separate
> parameterized paths and unparameterized paths into separate buckets.

To try to get some definitive info about this, I instrumented add_path
to emit a log message any time it compared two paths for which one's
parameterization was a strict subset of the other's, and yet the first
was not estimated to return more rows. Sure enough, I got a lot of
messages, just by running the regression tests (and even more with some
of the other test cases I'd been using for parameterized-path testing).
All of the hits were for equal numbers of rows, though -- there were
no cases with a rows difference in the opposite direction from
expectations.

After looking at the results, I think that the fallacy in what we've
been discussing is this: a parameterized path may well have some extra
selectivity over a less-parameterized one, but perhaps *not enough to be
meaningful*. The cases I was getting hits on were where the rowcount
estimate got rounded off to be the same as for the less-parameterized
path. (In this connection it's worth noting that most of the hits were
for rowcount estimates of only 1 or 2 rows.) So basically, the scenario
is where you have restriction clauses that are already enough to get
down to a small number of rows retrieved, and then you have some join
clauses that are not very selective and don't reduce the rowcount any
further. Or maybe you have some nicely selective join clauses, and then
adding more joins to some other relations doesn't help any further.

In situations like this, we want add_path to reject the ineffective
more-parameterized path as not being an improvement over the
less-parameterized path. Not having it do so might save cycles in
add_path itself, but we'd be being penny-wise and pound-foolish, because
not getting rid of the useless paths will cost us a lot more at the next
join level up.

So I'm back to thinking we need to look explicitly at the rowcount
comparison as well as all the existing conditions in add_path.

One annoying thing about that is that it will reduce the usefulness of
add_path_precheck, because that's called before we compute the rowcount
estimates (and indeed not having to make the rowcount estimates is one
of the major savings from the precheck). I think what we'll have to do
is assume that a difference in parameterization could result in a
difference in rowcount, and hence only a dominant path with exactly the
same parameterization can result in failing the precheck.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-03-05 01:53:53
Message-ID: CA+TgmoZctmaqzZ-wQRjHbbftyWZSEURiEG09MJwoUnVMk6Me0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 4, 2012 at 12:20 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> After looking at the results, I think that the fallacy in what we've
> been discussing is this: a parameterized path may well have some extra
> selectivity over a less-parameterized one, but perhaps *not enough to be
> meaningful*.  The cases I was getting hits on were where the rowcount
> estimate got rounded off to be the same as for the less-parameterized
> path.  (In this connection it's worth noting that most of the hits were
> for rowcount estimates of only 1 or 2 rows.)  So basically, the scenario
> is where you have restriction clauses that are already enough to get
> down to a small number of rows retrieved, and then you have some join
> clauses that are not very selective and don't reduce the rowcount any
> further.  Or maybe you have some nicely selective join clauses, and then
> adding more joins to some other relations doesn't help any further.

OK, makes sense.

> One annoying thing about that is that it will reduce the usefulness of
> add_path_precheck, because that's called before we compute the rowcount
> estimates (and indeed not having to make the rowcount estimates is one
> of the major savings from the precheck).  I think what we'll have to do
> is assume that a difference in parameterization could result in a
> difference in rowcount, and hence only a dominant path with exactly the
> same parameterization can result in failing the precheck.

I wish we had some way of figuring out how much this - and maybe some
of the other new planning possibilities like index-only scans - were
going to cost us on typical medium-to-large join problems. In the
absence of real-world data it's hard to know how worried we should be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-03-05 18:02:35
Message-ID: 23222.1330970555@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Mar 4, 2012 at 12:20 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> One annoying thing about that is that it will reduce the usefulness of
>> add_path_precheck, because that's called before we compute the rowcount
>> estimates (and indeed not having to make the rowcount estimates is one
>> of the major savings from the precheck). I think what we'll have to do
>> is assume that a difference in parameterization could result in a
>> difference in rowcount, and hence only a dominant path with exactly the
>> same parameterization can result in failing the precheck.

> I wish we had some way of figuring out how much this - and maybe some
> of the other new planning possibilities like index-only scans - were
> going to cost us on typical medium-to-large join problems. In the
> absence of real-world data it's hard to know how worried we should be.

I have been doing testing against a couple of complex queries supplied
by Kevin and Andres. It'd be nice to have a larger sample though ...

I'm a bit concerned that this change will end up removing most of the
usefulness of add_path_precheck. I would not actually cry if that went
away again, because hacking things like that greatly complicated the API
of the join cost functions. But it's nervous-making to be making
decisions like that on the basis of rather small sets of queries.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-03-05 18:14:02
Message-ID: CA+TgmobWqQ9ZQQ8CjTJ=pNUU9B2u-a7j=k4qP3gkSy57hQGOuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 5, 2012 at 1:02 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  But it's nervous-making to be making
> decisions like that on the basis of rather small sets of queries.

I heartily agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-12 19:27:57
Message-ID: 27109.1334258877@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> So I'm back to thinking we need to look explicitly at the rowcount
> comparison as well as all the existing conditions in add_path.

> One annoying thing about that is that it will reduce the usefulness of
> add_path_precheck, because that's called before we compute the rowcount
> estimates (and indeed not having to make the rowcount estimates is one
> of the major savings from the precheck). I think what we'll have to do
> is assume that a difference in parameterization could result in a
> difference in rowcount, and hence only a dominant path with exactly the
> same parameterization can result in failing the precheck.

I've been experimenting some more with this, and have observed that in
the test cases I'm using, adding rowcount as an additional criterion in
add_path doesn't cost much of anything: it doesn't seem to affect the
runtime significantly, and it only seldom changes the keep/reject
decisions. So that's good news.

Unfortunately, the precheck situation is actually worse than I thought:
there are plenty of cases where parameterized paths can have the exact
same parameterization (that is, same sets of required outer rels) and
yet have different row estimates, because one might use different join
clauses than the other. All you need to be at risk is more than one
join clause between the same two rels, with those clauses matching
different indexes or index columns. This entirely destroys the logic of
add_path_precheck as currently constituted, because it implies we can
never reject a parameterized path before computing its rowcount.

I said upthread that I wouldn't cry if we got rid of add_path_precheck
again, but it still looks like that would cost us a noticeable hit in
planning speed. I've considered three other alternatives:

1. Lobotomize add_path_precheck so it always returns true for a
parameterized path. This sounds horrid, but in the test cases I'm using
it seems that this only results in doing the full path construction for
a very small number of additional paths.

2. Refactor so that we obtain the row estimate during the first not the
second cost estimation step. This doesn't look promising; I have not
actually coded and tested it, but eyeballing gprof numbers for the
current code suggests it would give back a considerable percentage of
the savings from having a precheck at all.

3. Rearrange plan generation so that a parameterized path always uses
all join clauses available from the specified outer rels. (Any that
don't work as indexquals would have to be applied as filter conditions.)
If we did that, then we would be back to a situation where all paths
with the same parameterization should yield the same rowcount, thus
justifying letting add_path_precheck work as it does now.

#3 would amount to pushing quals that would otherwise be checked at the
nestloop join node down to the lowest inner-relation level where they
could be checked. This is something I'd suspected would be a good idea
to start with, but hadn't gotten around to implementing for non-index
quals. It had not occurred to me that it might simplify cost estimation
to always do that.

I'm going to take a closer look at #3, but it may not be practical to
try to squeeze it into 9.2; if not, I think #1 will do as a stopgap.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-13 14:17:48
Message-ID: CA+TgmoY1GxgZB-Dk3YjA6bMP4r=npW-HbWVrbmMw4uDK302yRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 12, 2012 at 3:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> 1. Lobotomize add_path_precheck so it always returns true for a
> parameterized path.  This sounds horrid, but in the test cases I'm using
> it seems that this only results in doing the full path construction for
> a very small number of additional paths.

Query planner engineering is hard, because it's hard to predict what
kind of queries people will write, but this seems basically sane to
me. Given that (if I recall our previous discuss on this point
correctly) we avoid generating parameterized paths in situations where
we could have simply revised the join order instead, we should only
really be getting any parameterized paths at all in situations where
they are likely to help. Queries involving only inner joins, for
example, should never need a parameterized path covering more than a
single baserel; and even if you've got outer joins in the mix, most of
my queries tend to look like A IJ B IJ C IJ D LJ E LJ F LJ G, rather
than A IJ (B LJ C) which is where we actually need this machinery. If
we spend a little more effort in that case it's quite likely to be
worth it; the trick is just to keep the extra work from bleeding into
the cases where it won't help.

> 3. Rearrange plan generation so that a parameterized path always uses
> all join clauses available from the specified outer rels.  (Any that
> don't work as indexquals would have to be applied as filter conditions.)
> If we did that, then we would be back to a situation where all paths
> with the same parameterization should yield the same rowcount, thus
> justifying letting add_path_precheck work as it does now.
>
> #3 would amount to pushing quals that would otherwise be checked at the
> nestloop join node down to the lowest inner-relation level where they
> could be checked.  This is something I'd suspected would be a good idea
> to start with, but hadn't gotten around to implementing for non-index
> quals.  It had not occurred to me that it might simplify cost estimation
> to always do that.

This seems like it could be quite a significant win. It doesn't
really matter in <= 9.1 because in an old-style parameterized nestloop
the join filter is going to get applied immediately after the index
filter anyway, though I guess it's possible that you might save a
little bit by optimizing the order in which multiple filter conditions
are applied. But if there can be intermediate joins in there then
it's a big deal; and the fact that it makes it easier to compare paths
and prune away bad ones earlier seems like a major benefit as well.
So I would be in favor of doing this if possible, but...

> I'm going to take a closer look at #3, but it may not be practical to
> try to squeeze it into 9.2; if not, I think #1 will do as a stopgap.

....I agree with this, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 16:14:41
Message-ID: 10175.1334679281@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Apr 12, 2012 at 3:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 3. Rearrange plan generation so that a parameterized path always uses
>> all join clauses available from the specified outer rels.  (Any that
>> don't work as indexquals would have to be applied as filter conditions.)
>> If we did that, then we would be back to a situation where all paths
>> with the same parameterization should yield the same rowcount, thus
>> justifying letting add_path_precheck work as it does now.
>>
>> #3 would amount to pushing quals that would otherwise be checked at the
>> nestloop join node down to the lowest inner-relation level where they
>> could be checked.  This is something I'd suspected would be a good idea
>> to start with, but hadn't gotten around to implementing for non-index
>> quals.  It had not occurred to me that it might simplify cost estimation
>> to always do that.

> This seems like it could be quite a significant win.

I've been hacking away on a patch to do this, and attached is something
that I think is pretty close to committable. It needs another going-over
and some new regression test cases, but it seems to work, and it fixes a
number of things besides the above-mentioned issue. In particular, this
has a much more principled approach than HEAD does to the problem of where
to place parameterizable join clauses in the plan tree; that can be seen
in the one change in the existing regression tests, where we no longer
generate a redundant upper-level copy of an OR join clause that the old
code wasn't bright enough to get rid of.

The patch is a bit large because I chose to revise the data representation.
Instead of each Path having its own required_outer, rows, and
param_clauses fields, now a parameterized Path has a pointer to a
ParamPathInfo struct that it shares with other Paths for the same rel and
the same parameterization. This guarantees that such paths will have the
same estimated rowcount, because we only compute that once per
parameterization, which should save some work as well as making the world
safe for add_path_precheck.

The only place where this approach proved a bit tricky was in handling
AppendPaths and MergeAppendPaths, which didn't surprise me because that
was a rough spot for the old way too (and indeed they aren't handled
completely correctly in HEAD). A parameterized path is now *required*
to enforce all clauses that the join clause movement rules assign to it;
but Append and MergeAppend don't do qual checking, and I didn't feel like
changing that. The method that I have settled on is to require all child
paths of a parameterized append to have the exact same parameterization,
IOW we push the qual checks down below the append. Now the interesting
point about that is that we want to support Appends wherein some children
are seqscans and some are indexscans (consider a partitioned table where
the parent is a dummy empty table with no indexes). The "raw" situation
there is that we'll have a plain seqscan path for the parent and then a
collection of similarly-parameterized indexscan paths for the live
partition children. To make it possible to convert that case into a
parameterized append path, I added parameterization support to seqscans
and then wrote "reparameterize_path", which changes a Path to increase
its parameterization level (and thereby assign it more pushed-down join
clauses to check at runtime). That allows us to reconfigure the seqscan
to match the other children. I've also added such support to
SubqueryScan, on the grounds that the other common use of append paths is
UNION ALL across subqueries. We might later want to add parameterization
support to other path types, but this seemed like enough for the moment.

BTW, after writing the code for it I decided to remove creation of
parameterized MergeAppendPaths from allpaths.c, though there is still some
support for them elsewhere. On reflection it seemed to me that the code
was willing to create far too many of these, much more than their
potential usefulness could justify (remember that parameterized paths must
be on the inside of a nestloop, so their sort ordering is typically of
marginal use). We can put that back if we can think of a more restrictive
heuristic for when to create them.

The core of the patch is in the new functions get_baserel_parampathinfo
and get_joinrel_parampathinfo, which look up or construct ParamPathInfos,
and join_clause_is_parameterizable_for and
join_clause_is_parameterizable_within, which control
movement of parameterizable join clauses. (I'm not that thrilled with the
names of the latter two functions, anybody got a better idea?) The rest
of it is pretty much boilerplate changes and replacing ad-hoc logic with
uses of this stuff.

I have a couple of other ideas in mind in the way of mop-up, but they are
not in this patch to keep it from bloating even more. First, I'm thinking
we should get rid of RelOptInfo.baserestrictcost, thus forcing all scan
cost estimators to invoke cost_qual_eval explicitly. That field has been
vestigial from a planning-speed standpoint for a long time, ever since we
started caching eval costs in RestrictInfos. The most commonly used cost
estimators don't use it anymore as of this patch, and it'd likely be best
to have a uniform coding pattern there. Second, I've gotten dissatisfied
with the terminology "required_outer" that was used in the original param
plans patch. I'm considering a global search and replace with
"param_relids" or some variant of that.

Comments?

regards, tom lane

Attachment Content-Type Size
param-plans-revision-1.patch text/x-patch 183.6 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 18:46:23
Message-ID: CA+TgmoZfOxNrnMJp8NgL08Y78-fh7TMCxDtCxDpoF6VqLFAXpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 17, 2012 at 12:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've been hacking away on a patch to do this, and attached is something
> that I think is pretty close to committable.  It needs another going-over
> and some new regression test cases, but it seems to work, and it fixes a
> number of things besides the above-mentioned issue.  In particular, this
> has a much more principled approach than HEAD does to the problem of where
> to place parameterizable join clauses in the plan tree; that can be seen
> in the one change in the existing regression tests, where we no longer
> generate a redundant upper-level copy of an OR join clause that the old
> code wasn't bright enough to get rid of.
>
> The patch is a bit large because I chose to revise the data representation.
> Instead of each Path having its own required_outer, rows, and
> param_clauses fields, now a parameterized Path has a pointer to a
> ParamPathInfo struct that it shares with other Paths for the same rel and
> the same parameterization.  This guarantees that such paths will have the
> same estimated rowcount, because we only compute that once per
> parameterization, which should save some work as well as making the world
> safe for add_path_precheck.

Seems reasonable.

> The only place where this approach proved a bit tricky was in handling
> AppendPaths and MergeAppendPaths, which didn't surprise me because that
> was a rough spot for the old way too (and indeed they aren't handled
> completely correctly in HEAD).  A parameterized path is now *required*
> to enforce all clauses that the join clause movement rules assign to it;
> but Append and MergeAppend don't do qual checking, and I didn't feel like
> changing that.  The method that I have settled on is to require all child
> paths of a parameterized append to have the exact same parameterization,
> IOW we push the qual checks down below the append.  Now the interesting
> point about that is that we want to support Appends wherein some children
> are seqscans and some are indexscans (consider a partitioned table where
> the parent is a dummy empty table with no indexes).  The "raw" situation
> there is that we'll have a plain seqscan path for the parent and then a
> collection of similarly-parameterized indexscan paths for the live
> partition children.  To make it possible to convert that case into a
> parameterized append path, I added parameterization support to seqscans
> and then wrote "reparameterize_path", which changes a Path to increase
> its parameterization level (and thereby assign it more pushed-down join
> clauses to check at runtime).  That allows us to reconfigure the seqscan
> to match the other children.  I've also added such support to
> SubqueryScan, on the grounds that the other common use of append paths is
> UNION ALL across subqueries.  We might later want to add parameterization
> support to other path types, but this seemed like enough for the moment.

OK.

> BTW, after writing the code for it I decided to remove creation of
> parameterized MergeAppendPaths from allpaths.c, though there is still some
> support for them elsewhere.  On reflection it seemed to me that the code
> was willing to create far too many of these, much more than their
> potential usefulness could justify (remember that parameterized paths must
> be on the inside of a nestloop, so their sort ordering is typically of
> marginal use).  We can put that back if we can think of a more restrictive
> heuristic for when to create them.

I guess the case in which this would matter is if you wrote something
like A LJ (B IJ C) where B and/or C has child tables and the best
method of joining them to each other is a marge join. That doesn't
seem all that likely; normally a hash join or nested loop will be
better. On the flip side I can't see that generating a bunch of extra
paths is going to hurt you much there either; they will fall away
pretty quickly if they aren't useful for merging. Now, if you have
something like A IJ B or A LJ B, where B is partitioned, it's clearly
a waste of time to generate parameterized paths with pathkeys.

> The core of the patch is in the new functions get_baserel_parampathinfo
> and get_joinrel_parampathinfo, which look up or construct ParamPathInfos,
> and join_clause_is_parameterizable_for and
> join_clause_is_parameterizable_within, which control
> movement of parameterizable join clauses.  (I'm not that thrilled with the
> names of the latter two functions, anybody got a better idea?)  The rest
> of it is pretty much boilerplate changes and replacing ad-hoc logic with
> uses of this stuff.

Parameterizable is such a mouthful. I wish we had a better word.

> I have a couple of other ideas in mind in the way of mop-up, but they are
> not in this patch to keep it from bloating even more.  First, I'm thinking
> we should get rid of RelOptInfo.baserestrictcost, thus forcing all scan
> cost estimators to invoke cost_qual_eval explicitly.  That field has been
> vestigial from a planning-speed standpoint for a long time, ever since we
> started caching eval costs in RestrictInfos.  The most commonly used cost
> estimators don't use it anymore as of this patch, and it'd likely be best
> to have a uniform coding pattern there.  Second, I've gotten dissatisfied
> with the terminology "required_outer" that was used in the original param
> plans patch.  I'm considering a global search and replace with
> "param_relids" or some variant of that.

Personally, I find required_outer more clear. YMMV.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 19:01:29
Message-ID: 1334689263-sup-361@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Robert Haas's message of mar abr 17 15:46:23 -0300 2012:
> On Tue, Apr 17, 2012 at 12:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> > The core of the patch is in the new functions get_baserel_parampathinfo
> > and get_joinrel_parampathinfo, which look up or construct ParamPathInfos,
> > and join_clause_is_parameterizable_for and
> > join_clause_is_parameterizable_within, which control
> > movement of parameterizable join clauses.  (I'm not that thrilled with the
> > names of the latter two functions, anybody got a better idea?)  The rest
> > of it is pretty much boilerplate changes and replacing ad-hoc logic with
> > uses of this stuff.
>
> Parameterizable is such a mouthful. I wish we had a better word.

P13e ?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 19:05:52
Message-ID: 13665.1334689552@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Apr 17, 2012 at 12:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, after writing the code for it I decided to remove creation of
>> parameterized MergeAppendPaths from allpaths.c, though there is still some
>> support for them elsewhere. On reflection it seemed to me that the code
>> was willing to create far too many of these, much more than their
>> potential usefulness could justify (remember that parameterized paths must
>> be on the inside of a nestloop, so their sort ordering is typically of
>> marginal use). We can put that back if we can think of a more restrictive
>> heuristic for when to create them.

> I guess the case in which this would matter is if you wrote something
> like A LJ (B IJ C) where B and/or C has child tables and the best
> method of joining them to each other is a marge join.

Well, we already made a policy decision that we weren't going to try
very hard to support merge joins inside parameterized subtrees, because
the potential growth in planning time looked nasty. My thought was that
we might resurrect the parameterized MergeAppendPath code when and if
we reverse that decision. At the moment, in fact, I believe that
add_path is pretty nearly guaranteed to discard a parameterized
MergeAppendPath immediately upon submission, because the fact that it's
sorted isn't given any weight for add_path comparisons, and cost-wise
it's going to look worse than the similarly parameterized plain Append
that we would have submitted just before.

>> Second, I've gotten dissatisfied
>> with the terminology "required_outer" that was used in the original param
>> plans patch. I'm considering a global search and replace with
>> "param_relids" or some variant of that.

> Personally, I find required_outer more clear. YMMV.

Perhaps. What's bothering me is the potential for confusion with outer
joins; the parameter-supplying rels are *not* necessarily on the other
side of an outer join. Anybody else have an opinion about that?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 19:30:39
Message-ID: CA+TgmobwFS9yHQ8vsaxFd1MROxjvOpGv3f3dTCOFgEY50RgeEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 17, 2012 at 3:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, we already made a policy decision that we weren't going to try
> very hard to support merge joins inside parameterized subtrees, because
> the potential growth in planning time looked nasty.  My thought was that
> we might resurrect the parameterized MergeAppendPath code when and if
> we reverse that decision.  At the moment, in fact, I believe that
> add_path is pretty nearly guaranteed to discard a parameterized
> MergeAppendPath immediately upon submission, because the fact that it's
> sorted isn't given any weight for add_path comparisons, and cost-wise
> it's going to look worse than the similarly parameterized plain Append
> that we would have submitted just before.

OK.

>>> Second, I've gotten dissatisfied
>>> with the terminology "required_outer" that was used in the original param
>>> plans patch.  I'm considering a global search and replace with
>>> "param_relids" or some variant of that.
>
>> Personally, I find required_outer more clear.  YMMV.
>
> Perhaps.  What's bothering me is the potential for confusion with outer
> joins; the parameter-supplying rels are *not* necessarily on the other
> side of an outer join.  Anybody else have an opinion about that?

Well, we also use the words "inner" and "outer" to refer to the sides
of any join, regardless of type. Maybe we could call it
"requires_nestloop_with" or "requires_join_to" or something like that.
The thing I don't like about "param_relids" is that "param" can refer
to an awful lot of different things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-17 20:33:37
Message-ID: 15371.1334694817@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Apr 17, 2012 at 3:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Personally, I find required_outer more clear. YMMV.

>> Perhaps. What's bothering me is the potential for confusion with outer
>> joins; the parameter-supplying rels are *not* necessarily on the other
>> side of an outer join. Anybody else have an opinion about that?

> Well, we also use the words "inner" and "outer" to refer to the sides
> of any join, regardless of type.

True.

> The thing I don't like about "param_relids" is that "param" can refer
> to an awful lot of different things.

Fair enough. I'll leave required_outer alone then, and adjust some
names in the new patch to be consistent with that.

As far as the other naming issue goes, it struck me that instead of
join_clause_is_parameterizable_xxx, we could call those functions
join_clause_is_movable_xxx, assuming it's okay to commandeer
the notion of "movable" for this particular usage. It seems a bit
less generic than "parameterizable" anyway. The "for" and "within"
bits don't fit with that though. The first one could reasonably
be called "join_clause_is_movable_to", since we're checking if it's
okay to push the clause to precisely that base relation, but I'm a
bit at a loss for a modifier for the other one. "into" would be
appropriate, but "to" and "into" are so close together that people
might get confused.

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-18 17:30:16
Message-ID: CAM-w4HNVEVV_OgnYJUtcr8eLwwXx7CJMRXw4Qd9RKxMnOFBAkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 17, 2012 at 5:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've been hacking away on a patch to do this, and attached is something
> that I think is pretty close to committable.

I have nothing to add but I just wanted to say thank you for taking
the time to write up this explanation. Even when some of us don't
comment on some of the longer, more technical emails, we're definitely
still reading them and I at least find them invaluable for trying to
keep up with the changes over time. It's also been key to establishing
this practice as a good value for other patch authors to try to
emulate.

--
greg


From: Hakan Kocaman <hkocam(at)googlemail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-18 17:43:45
Message-ID: CAKoGa8KWZDjsPm2gtsquUx59iRKxN1oWORBF=G_N0TbMJD8Rrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2012/4/18 Greg Stark <stark(at)mit(dot)edu>

> On Tue, Apr 17, 2012 at 5:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I've been hacking away on a patch to do this, and attached is something
> > that I think is pretty close to committable.
>
> [..]Even when some of us don't
> comment on some of the longer, more technical emails, we're definitely
> still reading them and I at least find them invaluable for trying to
> keep up with the changes over time. [..]
>

+1

hakan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-04-19 01:13:03
Message-ID: 22521.1334797983@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So while testing this patch I've found out that there is a pretty nasty
bug in HEAD as well as in my current formulation of the patch. Here
is an example using the regression database:

select count(*) from
(tenk1 a join tenk1 b on a.unique1=b.unique2)
left join tenk1 c on a.unique2 = b.unique1 and c.thousand = a.thousand
join int4_tbl on b.thousand = f1;

The correct answer to this query is 10, according to all previous PG
branches, but HEAD is reporting zero. A look at the query plan
soon finds the smoking gun:

Aggregate (cost=264.29..264.30 rows=1 width=0)
-> Nested Loop Left Join (cost=0.00..264.16 rows=50 width=0)
Join Filter: (a.unique2 = b.unique1)
-> Nested Loop (cost=0.00..234.21 rows=50 width=12)
Join Filter: (b.unique2 = a.unique1)
-> Nested Loop (cost=0.00..211.85 rows=50 width=8)
-> Seq Scan on int4_tbl (cost=0.00..1.05 rows=5 width=4)
-> Index Scan using tenk1_thous_tenthous on tenk1 b (cost=0.00..42.04 rows=10 width=12)
Index Cond: (thousand = int4_tbl.f1)
-> Index Scan using tenk1_unique2 on tenk1 a (cost=0.00..0.43 rows=1 width=12)
Index Cond: (unique2 = b.unique1)
-> Index Only Scan using tenk1_thous_tenthous on tenk1 c (cost=0.00..0.45 rows=10 width=4)
Index Cond: (thousand = a.thousand)

The condition a.unique2 = b.unique1 is getting pushed down into
the a/b join, where it should not go; applying it there causes
a/b join rows to be discarded when they ought to survive and
then be null-extended at the left join.

What this demonstrates is that the rule for identifying safe movable
clauses that's used in HEAD is not good enough. It rejects clauses
that reference relations that are on the inside of a left join relative
to the target relation --- but the problematic clause doesn't actually
reference c, so that doesn't trigger. The issue would go away if we
examined required_relids instead of clause_relids, but that causes
other, perfectly safe, optimizations to be discarded.

I believe that the correct formulation of the join clause movement rule
is "outer join clauses can't be pushed into the left-hand side of their
outer join". However, the present representation of clauses doesn't
provide enough information to test this cheaply during scan planning
(indeed maybe it can't be done at all after distribute_qual_to_rels,
because we don't retain any explicit memory of which clauses are above
or below which outer joins). Looking at nullable_relids isn't the right
thing because what that tells you about is what's on the right-hand side
of the outer join, not the left side.

So I'm coming to the conclusion that to get this right, we need to
add another relids field to RestrictInfo and have it filled in during
distribute_qual_to_rels. This is really probably going to end up
cheaper than what we have today, because the join clause movement
tests are somewhat expensive as the code stands, and it should be
possible to reduce the number of operations needed there.

What I have in mind is that the new field would be named something like
outer_left_relids, and would list the relids of all base rels that are
in the left-hand side of the outer join that the clause belongs to
(or be NULL if the clause isn't an outer-join clause). This is cheaply
constructable during distribute_qual_to_rels, we just are not bothering
to record the information at present. Then the join clause movement
checks can easily detect that it would be unsafe to push down such a
clause to any of the listed relations.

regards, tom lane