Re: Cost estimates for parameterized paths

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Cost estimates for parameterized paths
Date: 2010-09-02 21:31:12
Message-ID: 14624.1283463072@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Awhile back I ranted about replacing the planner's concept of inner
indexscans with a more generalized notion of "parameterized paths":
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

The executor fixes for that are done, and now I'm grappling with getting
the planner to do something useful with it. The biggest problem I've run
into is that a parameterized path can't really be assigned a fixed cost
in the same way that a normal path can. The current implementation of
cost_index() depends on knowing the size of the outer relation --- that
is, the expected number of execution loops for the indexscan --- in order
to account for cache effects sanely while estimating the average cost of
any one inner indexscan. We know that that is an important thing to do
because the cost estimates seem to be a lot closer to reality now that we
do that than what we were getting before; so dropping the consideration is
entirely out of the question.

The planner is already cheating on this to a considerable extent, because
it estimates the cost of an inner indexscan only once, using the first
outer rel we try to join to. That cost is cached and reused with other
potential outer-rel join partners, even though very different numbers of
outer rows might be involved. This problem will get a lot worse with the
types of plans that I hope the planner will be able to come up with after
this fix goes in, because the indexscan might be at the bottom of a join
nest. So we need a real fix not another hack.

The best idea I can come up with at the moment is to compute "best case"
and "worst case" costs for a parameterized path, corresponding to the
largest and smallest numbers of loops we might expect it to be repeated
for. The largest number of loops could be estimated via the cartesian
product of the sizes of the other tables in the query, for example. The
"worst case" cost is its cost if only repeated once. Then, during path
pruning in add_path, we only discard a parameterized path if its best-case
cost is worse than the worst-case cost of some otherwise comparable path.
Whenever we join the parameterized path with the required outer relation,
we redo the cost calculation using that rel's actual rowcount estimate in
order to form a final cost estimate for the no-longer-parameterized join
path.

While this looks like it would work in principle, I'm concerned that it
would be unable to prune very many parameterized paths, and thus that
planning time might run unacceptably long. The repeated cost calculations
aren't much fun either, although we could probably cache most of that work
if we're willing to throw memory at the problem.

I wonder if anyone's got a better idea ...

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-03 11:51:18
Message-ID: AANLkTi=PUeVs5==SZmcLOYo_vGi+QgZPFjGxLiHdX3Gq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Awhile back I ranted about replacing the planner's concept of inner
> indexscans with a more generalized notion of "parameterized paths":
> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php
>
> The executor fixes for that are done, and now I'm grappling with getting
> the planner to do something useful with it.  The biggest problem I've run
> into is that a parameterized path can't really be assigned a fixed cost
> in the same way that a normal path can.  The current implementation of
> cost_index() depends on knowing the size of the outer relation --- that
> is, the expected number of execution loops for the indexscan --- in order
> to account for cache effects sanely while estimating the average cost of
> any one inner indexscan.  We know that that is an important thing to do
> because the cost estimates seem to be a lot closer to reality now that we
> do that than what we were getting before; so dropping the consideration is
> entirely out of the question.
>
> The planner is already cheating on this to a considerable extent, because
> it estimates the cost of an inner indexscan only once, using the first
> outer rel we try to join to.  That cost is cached and reused with other
> potential outer-rel join partners, even though very different numbers of
> outer rows might be involved.  This problem will get a lot worse with the
> types of plans that I hope the planner will be able to come up with after
> this fix goes in, because the indexscan might be at the bottom of a join
> nest.  So we need a real fix not another hack.
>
> The best idea I can come up with at the moment is to compute "best case"
> and "worst case" costs for a parameterized path, corresponding to the
> largest and smallest numbers of loops we might expect it to be repeated
> for.  The largest number of loops could be estimated via the cartesian
> product of the sizes of the other tables in the query, for example.  The
> "worst case" cost is its cost if only repeated once.  Then, during path
> pruning in add_path, we only discard a parameterized path if its best-case
> cost is worse than the worst-case cost of some otherwise comparable path.
> Whenever we join the parameterized path with the required outer relation,
> we redo the cost calculation using that rel's actual rowcount estimate in
> order to form a final cost estimate for the no-longer-parameterized join
> path.

Interestingly, I previously proposed almost exactly this approach to
handle a couple of other problems:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php
(index only scans)
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00125.php
(per-query work mem)

I'm not entirely sure whether we can use this approach for more than
one kind of problem at a time; if we can't, it's probably not a good
idea to do it at all. I also fear that any venture into this area
will involve slowing down add_path(), which is a hotspot when
planning large join nests. That might not be a win, if this is the
only case it allows us to improve.

*thinks*

It strikes me that the outer tuple count is essentially being used to
derate the cost of the index probes. So another way to look at this
is that the best-case cost of an index probe is just the CPU cost, and
the worst-case cost of an index probe is the CPU cost plus the disk
cost. So your fear that not many parameterized paths will be
discarded is probably valid. But then, maybe that's OK. It seems to
me that the critical point is to make sure that we don't form them in
the first place in cases where we could get the same benefit by
commuting the joins. Once we've gotten to the point of considering a
plan of this type, the chances that it's actually the best plan seem
pretty high.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-03 18:04:13
Message-ID: 26520.1283537053@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The best idea I can come up with at the moment is to compute "best case"
>> and "worst case" costs for a parameterized path,

> Interestingly, I previously proposed almost exactly this approach to
> handle a couple of other problems:

I thought it seemed familiar ;-)

> I'm not entirely sure whether we can use this approach for more than
> one kind of problem at a time; if we can't, it's probably not a good
> idea to do it at all.

I don't see why we couldn't do it in principle. The issue of course
is how many paths survive, and can we tolerate that from a planner
performance standpoint?

On reflection I think that for parameterized paths the problem won't be
too bad, because (a) we'll ignore parameterized paths except when
considering a join to the right outer rel, so their presence in the
rel's pathlist won't cost much otherwise, and (b) we will only generate
them when there's a suitable joinclause, so the number of potential
paths will be limited. I am not sure those statements can be made for
the other applications you suggested, though.

There's probably not much we can do at this point except code up the
idea and see how well it works.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-03 20:25:17
Message-ID: AANLkTi=xxhWJ4u0Eqsb0+D04L4+M_MDvEvqv-=SpmG_a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> On reflection I think that for parameterized paths the problem won't be
> too bad, because (a) we'll ignore parameterized paths except when
> considering a join to the right outer rel, so their presence in the
> rel's pathlist won't cost much otherwise,

Hmm. Maybe they should go into a separate path list, and perhaps we
could do the min/max calculations only with that pathlist (at least
for now), thus avoiding taking a generalized penalty to handle this
specific case. IIUC, a parameterized path should never cause an
unparamaterized path to be thrown out, even if the unparameterized
path costs more than the worst-case cost for the parameterized path,
because the parameterized path constrains the possible join strategies
higher up, so what looked like a great idea at first blush might turn
out to suck when the chips are down. Then, too, I'm not sure we can
even guarantee it will always be possible to form a valid plan around
a given parameterized path, which is another good reason not to
discard any unparameterized alternatives that may exist.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-03 21:53:53
Message-ID: 536.1283550833@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> On reflection I think that for parameterized paths the problem won't be
>> too bad, because (a) we'll ignore parameterized paths except when
>> considering a join to the right outer rel, so their presence in the
>> rel's pathlist won't cost much otherwise,

> Hmm. Maybe they should go into a separate path list, and perhaps we
> could do the min/max calculations only with that pathlist (at least
> for now), thus avoiding taking a generalized penalty to handle this
> specific case. IIUC, a parameterized path should never cause an
> unparamaterized path to be thrown out,

Yeah, but the converse isn't true. I had considered the idea of keeping
parameterized paths in a separate list, but you'd still have to examine
the main list to look for unparameterized paths that might dominate them.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-04 00:35:33
Message-ID: AANLkTik25SAqBQOCjj68SK1OctMk8_C5+=ELCM+uMxOX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 3, 2010 at 5:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> On reflection I think that for parameterized paths the problem won't be
>>> too bad, because (a) we'll ignore parameterized paths except when
>>> considering a join to the right outer rel, so their presence in the
>>> rel's pathlist won't cost much otherwise,
>
>> Hmm.  Maybe they should go into a separate path list, and perhaps we
>> could do the min/max calculations only with that pathlist (at least
>> for now), thus avoiding taking a generalized penalty to handle this
>> specific case.  IIUC, a parameterized path should never cause an
>> unparamaterized path to be thrown out,
>
> Yeah, but the converse isn't true.  I had considered the idea of keeping
> parameterized paths in a separate list, but you'd still have to examine
> the main list to look for unparameterized paths that might dominate them.

Definitely true, but if it avoids slowing down add_path() in the
common case, it's worth it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2011-11-09 22:12:46
Message-ID: 21254.1320876766@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

More than a year ago, I wrote in
http://archives.postgresql.org/message-id/14624.1283463072@sss.pgh.pa.us

> Awhile back I ranted about replacing the planner's concept of inner
> indexscans with a more generalized notion of "parameterized paths":
> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

> The executor fixes for that are done, and now I'm grappling with getting
> the planner to do something useful with it. The biggest problem I've run
> into is that a parameterized path can't really be assigned a fixed cost
> in the same way that a normal path can. The current implementation of
> cost_index() depends on knowing the size of the outer relation --- that
> is, the expected number of execution loops for the indexscan --- in order
> to account for cache effects sanely while estimating the average cost of
> any one inner indexscan.

Since this project has been stalled for so long, I am thinking that what
I need to do to make some progress is to punt on the repeated-execution
cache effects problem, at least for the first cut. I propose costing
parameterized inner paths on the "worst case" basis that they're only
executed once, and don't get any benefit from caching across repeated
executions. This seems like a reasonably sane first-order approximation
on two grounds:

1. In most of the cases where such a plan is of interest, the outer
relation for the nestloop actually does provide only one or a few rows.
If it generates a lot of rows, you probably don't want a nestloop
anyhow.

2. In the cases where we really care, the alternatives are so much worse
that the parameterized nestloop will win even if it's estimated very
conservatively.

Another thing in the back of my mind is that the whole issue of cache
effects is something we know we don't model very well, so putting large
amounts of time into enlarging the present approach to handle more
complicated plan structures may be misplaced effort anyway.

So unless somebody's got a better idea, I'm going to push forward with
getting the parameterized-path infrastructure in place, and just settle
for a crude cost model for the moment. Perhaps the implementation
experience will shake loose some new ideas about how to do the cost
modeling.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2011-11-10 02:06:05
Message-ID: CA+Tgmob+ba4BEbV1yCH2yrQpdpj9=jTf3nP4jMrgy+Q9AFt7ow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 9, 2011 at 5:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> More than a year ago, I wrote in
> http://archives.postgresql.org/message-id/14624.1283463072@sss.pgh.pa.us
>
>> Awhile back I ranted about replacing the planner's concept of inner
>> indexscans with a more generalized notion of "parameterized paths":
>> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php
>
>> The executor fixes for that are done, and now I'm grappling with getting
>> the planner to do something useful with it.  The biggest problem I've run
>> into is that a parameterized path can't really be assigned a fixed cost
>> in the same way that a normal path can.  The current implementation of
>> cost_index() depends on knowing the size of the outer relation --- that
>> is, the expected number of execution loops for the indexscan --- in order
>> to account for cache effects sanely while estimating the average cost of
>> any one inner indexscan.
>
> Since this project has been stalled for so long, I am thinking that what
> I need to do to make some progress is to punt on the repeated-execution
> cache effects problem, at least for the first cut.  I propose costing
> parameterized inner paths on the "worst case" basis that they're only
> executed once, and don't get any benefit from caching across repeated
> executions.  This seems like a reasonably sane first-order approximation
> on two grounds:
>
> 1. In most of the cases where such a plan is of interest, the outer
> relation for the nestloop actually does provide only one or a few rows.
> If it generates a lot of rows, you probably don't want a nestloop
> anyhow.
>
> 2. In the cases where we really care, the alternatives are so much worse
> that the parameterized nestloop will win even if it's estimated very
> conservatively.

I agree, on all counts. Errors that make new planner possibilities
look unduly expensive aren't as serious as those that go the other
way, because the alternative is that you can never generate the new
plan at all. That's why I'm sweating about the costing index-only
scans a bit.

> Another thing in the back of my mind is that the whole issue of cache
> effects is something we know we don't model very well, so putting large
> amounts of time into enlarging the present approach to handle more
> complicated plan structures may be misplaced effort anyway.

True. And I think that might not even be the highest priority project
to tackle anyway. The things that are hurting people most routinely
and hardest to fix with existing tools seem to be things like
cross-column correlation, and other selectivity estimation errors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company