Re: Function execution costs 'n all that

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Function execution costs 'n all that
Date: 2007-01-15 16:15:07
Message-ID: 16268.1168877707@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So I've been working on the scheme I suggested a few days ago of
representing "equivalence classes" of variables explicitly, and avoiding
the current ad-hocery of generating and then removing redundant clauses
in favor of generating only the ones we want in the first place. Any
clause that looks like an equijoin gets sent to the EquivalenceClass
machinery by distribute_qual_to_rels, and not put into the
restrictlist/joinlist data structure at all. Then we make passes over
the EquivalenceClass lists at appropriate times to generate the clauses
we want. This is turning over well enough now to pass the regression
tests, but I noticed that one query in opr_sanity got a whole lot slower
than before. The query is

SELECT p1.opcname, p1.opcfamily
FROM pg_opclass AS p1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2
WHERE p2.amopfamily = p1.opcfamily
AND binary_coercible(p1.opcintype, p2.amoplefttype));

and investigation showed that the plan changed from (8.2 and before)

Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
Filter: ((amopfamily = $0) AND binary_coercible($1, amoplefttype))

to

Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
Filter: (NOT (subplan))
SubPlan
-> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
Filter: (binary_coercible($1, amoplefttype) AND (amopfamily = $0))

thus resulting in many more calls of binary_coercible() which is a
pretty expensive function. This is not too surprising: the clause
p2.amopfamily = p1.opcfamily got diverted through the EquivalenceClass
code for just long enough to end up behind the other one in the table's
restrictlist :-(

In short, this approach results in a whole lot less stability in the
order in which WHERE clauses are evaluated. That might be a killer
objection to the whole thing, but on the other hand we've never made
any strong promises about WHERE evaluation order.

Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order. This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.

Would a simple constant value be workable, or do we need some more
complex model (and if so what)?

Comments?

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 16:50:27
Message-ID: 45ABB0D3.7040606@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Instead, I'm thinking it might be time to re-introduce some notion of
> function execution cost into the system, and make use of that info to
> sort WHERE clauses into a reasonable execution order.

That sounds like a straightforward idea.

> This example
> would be fixed with even a very stupid rule-of-thumb about SQL functions
> being more expensive than C functions, but if we're going to go to the
> trouble it seems like it'd be a good idea to provide a way to label
> user-defined functions with execution costs.

Agreed.

> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

A simple constant would probably be enough. If we want anything fancier
than that, it should be up to the author of the function to define the
cost model as well. I'm envisioning that you could attach a custom cost
function to a user-defined function which would return the estimated CPU
cost. And # of rows returned for a set-returning function.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 17:00:39
Message-ID: 16832.1168880439@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?

> A simple constant would probably be enough. If we want anything fancier
> than that, it should be up to the author of the function to define the
> cost model as well. I'm envisioning that you could attach a custom cost
> function to a user-defined function which would return the estimated CPU
> cost. And # of rows returned for a set-returning function.

But what will such an estimation function work on? In general the
planner does not have the value(s) that will be passed to the actual
function at runtime. It might have expressions or estimates but
those data structures are certainly not something we could pass to
non-C-coded functions. Are we willing to restrict these functions
to being coded in C, as selectivity estimation functions are?

If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).

BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost. The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates. We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 17:11:05
Message-ID: 45ABB5A9.6080201@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> A simple constant would probably be enough. If we want anything fancier
>> than that, it should be up to the author of the function to define the
>> cost model as well. I'm envisioning that you could attach a custom cost
>> function to a user-defined function which would return the estimated CPU
>> cost. And # of rows returned for a set-returning function.
>
> But what will such an estimation function work on? In general the
> planner does not have the value(s) that will be passed to the actual
> function at runtime. It might have expressions or estimates but
> those data structures are certainly not something we could pass to
> non-C-coded functions. Are we willing to restrict these functions
> to being coded in C, as selectivity estimation functions are?

Yeah, I don't know. If the planner knows the actual value, that would
certainly be the easiest for the UDF writer to work with. Anything more
than that gets really complicated.

> If we go this route it seems like we'll need four more columns in
> pg_proc (cost estimation function OID, rowcount estimation function OID,
> fallback cost constant, fallback rowcount constant).

What would the fallbacks be for?

> BTW, I'm thinking that a "cost constant" probably ought to be measured
> in units of cpu_operator_cost. The default for built-in functions would
> thus be 1, at least till such time as someone wants to refine the
> estimates. We'd probably want the default for PL and SQL functions to
> be 10 or 100 or so.

Agreed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 17:19:03
Message-ID: 17077.1168881543@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> If we go this route it seems like we'll need four more columns in
>> pg_proc (cost estimation function OID, rowcount estimation function OID,
>> fallback cost constant, fallback rowcount constant).

> What would the fallbacks be for?

By "fallback" I meant "this is what to use if no estimation function is
provided".

I'm envisioning that the CREATE FUNCTION syntax would add optional
clauses

COST function-name-or-numeric-constant
ROWS function-name-or-numeric-constant

that would be used to fill these columns.

regards, tom lane


From: Richard Troy <rtroy(at)ScienceTools(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 18:51:46
Message-ID: Pine.LNX.4.33.0701151033290.31249-100000@denzel.in
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 15 Jan 2007, Tom Lane wrote:

> So I've been working on the scheme I suggested a few days ago of
> representing "equivalence classes" of variables explicitly, and avoiding
> the current ad-hocery of generating and then removing redundant clauses
> in favor of generating only the ones we want in the first place. Any
> clause that looks like an equijoin gets sent to the EquivalenceClass
> machinery by distribute_qual_to_rels, and not put into the
> restrictlist/joinlist data structure at all. Then we make passes over
> the EquivalenceClass lists at appropriate times to generate the clauses
> we want. This is turning over well enough now to pass the regression
> tests,

That was quick...

> In short, this approach results in a whole lot less stability in the
> order in which WHERE clauses are evaluated. That might be a killer
> objection to the whole thing, but on the other hand we've never made
> any strong promises about WHERE evaluation order.

Showing my ignorance here, but I've never been a fan of "syntax based
optimization," though it is better than no optimization. If people are
counting on order for optimization, then, hmmm... If you can provide a way
to at least _try_ to do better, then don't worry about it. It will improve
with time.

> Instead, I'm thinking it might be time to re-introduce some notion of
> function execution cost into the system, and make use of that info to
> sort WHERE clauses into a reasonable execution order.

Ingres did/does it that way, IIRC. It's a solid strategy.

> This example
> would be fixed with even a very stupid rule-of-thumb about SQL functions
> being more expensive than C functions, but if we're going to go to the
> trouble it seems like it'd be a good idea to provide a way to label
> user-defined functions with execution costs.
>
> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

Ingres would, if I'm not mistaken, gain through historical use through
histograms. Short of that, you've got classes of functions, agregations,
for example, and there's sure to be missing information to make a great
decision at planning time. However, I take it that the cost here is
primarily CPU and not I/O. I therefore propose that the engine evaluate -
benchmark, if you will - all functions as they are ingested, or
vacuum-like at some later date (when valid data for testing may exist),
and assign a cost relative to what it already knows - the built-ins, for
example. Doing so could allow this strategy to be functional in short
order and be improved with time so all the work doesn't have to be
implemented on day 1. And, DBA/sys-admin tweaking can always be done by
updating the catalogues.

HTH,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/


From: Neil Conway <neilc(at)samurai(dot)com>
To: Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 18:54:48
Message-ID: 1168887288.6174.109.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> I therefore propose that the engine evaluate -
> benchmark, if you will - all functions as they are ingested, or
> vacuum-like at some later date (when valid data for testing may exist),
> and assign a cost relative to what it already knows - the built-ins, for
> example.

That seems pretty unworkable. It is unsafe, for one: evaluating a
function may have side effects (inside or outside the database), so the
DBMS cannot just invoke user-defined functions at whim. Also, the
relationship between a function's arguments and its performance will
often be highly complex -- it would be very difficult, not too mention
computationally infeasible, to reconstruct that relationship
automatically, especially without any real knowledge about the
function's behavior.

-Neil


From: Brian Hurt <bhurt(at)janestcapital(dot)com>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Richard Troy <rtroy(at)ScienceTools(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 19:45:46
Message-ID: 45ABD9EA.3060703@janestcapital.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway wrote:

>On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
>
>
>>I therefore propose that the engine evaluate -
>>benchmark, if you will - all functions as they are ingested, or
>>vacuum-like at some later date (when valid data for testing may exist),
>>and assign a cost relative to what it already knows - the built-ins, for
>>example.
>>
>>
>
>That seems pretty unworkable. It is unsafe, for one: evaluating a
>function may have side effects (inside or outside the database), so the
>DBMS cannot just invoke user-defined functions at whim. Also, the
>relationship between a function's arguments and its performance will
>often be highly complex -- it would be very difficult, not too mention
>computationally infeasible, to reconstruct that relationship
>automatically, especially without any real knowledge about the
>function's behavior.
>
>
Non-developer here, but we use a lot of plpgsql functions at work. And
the functions we use fall into two broad, ill-defined catagories-
"expensive" functions and "cheap" functions. What I'd like as a user is
some way to tell the planner "this function is expensive- prefer plans
which call this function less even if they're otherwise more expensive"
or "this function is cheap, prefer plans that are otherwise less
expensive even if they call this function more often". Precise cost
estimates aren't that important, IMHO.

Brian


From: Richard Troy <rtroy(at)ScienceTools(dot)com>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 19:47:10
Message-ID: Pine.LNX.4.33.0701151144290.31545-100000@denzel.in
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 15 Jan 2007, Neil Conway wrote:
> On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> > I therefore propose that the engine evaluate -
> > benchmark, if you will - all functions as they are ingested, or
> > vacuum-like at some later date (when valid data for testing may exist),
> > and assign a cost relative to what it already knows - the built-ins, for
> > example.
>
> That seems pretty unworkable. It is unsafe, for one: evaluating a
> function may have side effects (inside or outside the database), so the
> DBMS cannot just invoke user-defined functions at whim. Also, the
> relationship between a function's arguments and its performance will
> often be highly complex -- it would be very difficult, not too mention
> computationally infeasible, to reconstruct that relationship
> automatically, especially without any real knowledge about the
> function's behavior.
>
> -Neil

Hi Neil,

Tom had already proposed:
>
> I'm envisioning that the CREATE FUNCTION syntax would add optional
> clauses
>
> COST function-name-or-numeric-constant
> ROWS function-name-or-numeric-constant
>
> that would be used to fill these columns.

I was considering these ideas in the mix; let the user provide either a
numeric or a function, the distinction here being that instead of running
that function at planning-time, it could be run "off-line", so to speak.

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brian Hurt <bhurt(at)janestcapital(dot)com>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 20:05:06
Message-ID: 19701.1168891506@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brian Hurt <bhurt(at)janestcapital(dot)com> writes:
> Non-developer here, but we use a lot of plpgsql functions at work. And
> the functions we use fall into two broad, ill-defined catagories-
> "expensive" functions and "cheap" functions. What I'd like as a user is
> some way to tell the planner "this function is expensive- prefer plans
> which call this function less even if they're otherwise more expensive"
> or "this function is cheap, prefer plans that are otherwise less
> expensive even if they call this function more often". Precise cost
> estimates aren't that important, IMHO.

Right, so a plain constant cost would be plenty for your situation.

I suspect there's an 80/20 rule at work here --- the estimator-function
side of this will take most of the effort to design/implement, but not
get used nearly as much as the plain-constant form ... maybe we should
just do the constant for starters and see how many people really want to
write C-code estimators ...

regards, tom lane


From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brian Hurt <bhurt(at)janestcapital(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 20:35:32
Message-ID: 1168893332.6174.127.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
> maybe we should just do the constant for starters and see how many
> people really want to write C-code estimators ...

+1

BTW, your proposal would still pushdown all qualifiers, right?
Hellerstein's xfunc work discusses situations in which it makes sense to
pullup expensive qualifiers above joins, for example, in order to reduce
the number of tuples the qualifier is applied to. Unfortunately, this
would probably increase the optimizer's search space by a fairly
significant degree, so it might need to be controlled by a GUC variable,
or only applied when the estimated cost of applying a qualifier is
particularly large relative to the total estimated cost of the plan.

-Neil


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Brian Hurt <bhurt(at)janestcapital(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-15 20:46:31
Message-ID: 22609.1168893991@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway <neilc(at)samurai(dot)com> writes:
> BTW, your proposal would still pushdown all qualifiers, right?

Yeah, I have no intention of readopting xfunc in the near future ...
especially seeing that it's possible for the user to force that
sort of thing if he really has to.

SELECT * FROM (SELECT ... OFFSET 0) ss
WHERE expensive_function(...);

regards, tom lane


From: Mark Cave-Ayland <mark(dot)cave-ayland(at)ilande(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brian Hurt <bhurt(at)janestcapital(dot)com>, Neil Conway <neilc(at)samurai(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-16 08:23:12
Message-ID: 1168935792.5652.46.camel@mca-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
> Brian Hurt <bhurt(at)janestcapital(dot)com> writes:
> > Non-developer here, but we use a lot of plpgsql functions at work. And
> > the functions we use fall into two broad, ill-defined catagories-
> > "expensive" functions and "cheap" functions. What I'd like as a user is
> > some way to tell the planner "this function is expensive- prefer plans
> > which call this function less even if they're otherwise more expensive"
> > or "this function is cheap, prefer plans that are otherwise less
> > expensive even if they call this function more often". Precise cost
> > estimates aren't that important, IMHO.
>
> Right, so a plain constant cost would be plenty for your situation.
>
> I suspect there's an 80/20 rule at work here --- the estimator-function
> side of this will take most of the effort to design/implement, but not
> get used nearly as much as the plain-constant form ... maybe we should
> just do the constant for starters and see how many people really want to
> write C-code estimators ...
>
> regards, tom lane

Hi Tom et al,

Having worked with stored procedures on large datasets for reporting, I
would say that it would be useful to have a non-constant estimator for
the number of rows, whereas a single CPU cost constant should be fine.
Where I have struggled with this has been joining onto slightly more
exotic queries when doing large scale data processing as part of a
custom report or an application upgrade.

Using PL/PGSQL I would find it useful to have access to the constants
passed into a function to be used to help provide a row count estimate
(typically useful for passing in table/column names), e.g.

SELECT * FROM my_func('my_table1') AS t1, my_table2 AS t2 WHERE t1.id =
t2.id;

CREATE FUNCTION my_func(text) AS $$
...
$$ LANGUAGE 'plpgsql' COST 1.0 ROWS my_func_row_cost;

In my cost function, I could then estimate the number of rows using
something like below, where all constants are passed into the cost
function as parameters, e.g.:

CREATE FUNCTION my_func_row_cost(text) AS $$
DECLARE
foo bigint;
BEGIN
EXECUTE INTO foo 'SELECT COUNT(*) FROM ' || quote_literal($1);
RETURN foo;
END;
$$ LANGUAGE 'plpgsql';

In the case where a parameter was not a constant but a column name, then
it would be reasonable in my mind to simply replace that parameter with
NULL when passing to the row cost function, e.g.

SELECT * FROM my_table1 WHERE my_table1.junk = (SELECT
my_func(my_table1.name));

In this case, the text parameter passed to my_func_row_cost would be
replaced by NULL to indicate that it was non-constant.

Of course, even with constants passed upon input, it still may not be
possible to calculate a number of rows that can be returned - it could
be the case that the only parameter passed to cost function has been
converted to NULL because it is a column name. Perhaps in this case it
would be useful to specify returning NULL from my_func_row_cost means "I
can't return anything meaningful, so use the fallback values".

Kind regards,

Mark.


From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-16 14:13:28
Message-ID: 87lkk3qctz.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Tom Lane wrote:
> > Instead, I'm thinking it might be time to re-introduce some notion of
> > function execution cost into the system, and make use of that info to
> > sort WHERE clauses into a reasonable execution order.

I imagine you've thought of this already but just in case, the cost of the
function call has to be combined with the selectivity to get this right. If
you can do an expensive but very selective clause first and save 100 cheap
calls that almost always return true it may still be worthwhile.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-16 16:13:56
Message-ID: 20907.1168964036@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <gsstark(at)mit(dot)edu> writes:
> I imagine you've thought of this already but just in case, the cost of the
> function call has to be combined with the selectivity to get this right. If
> you can do an expensive but very selective clause first and save 100 cheap
> calls that almost always return true it may still be worthwhile.

I've thought of it, but I haven't figured out a reasonable algorithm for
ordering the clauses in view of that. Have you?

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Stark" <gsstark(at)mit(dot)edu>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-16 16:46:53
Message-ID: 87wt3mly0y.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <gsstark(at)mit(dot)edu> writes:
>> I imagine you've thought of this already but just in case, the cost of the
>> function call has to be combined with the selectivity to get this right. If
>> you can do an expensive but very selective clause first and save 100 cheap
>> calls that almost always return true it may still be worthwhile.
>
> I've thought of it, but I haven't figured out a reasonable algorithm for
> ordering the clauses in view of that. Have you?

Hum, I hadn't tried. Now that I think about it it's certainly not obvious.

And picturing the possible failure modes I would rather it execute cheap
expressions more often than necessary than call some user-defined perl
function that could be doing i/o or involve waiting on other resources any
more than absolutely necessary. So I guess what you originally described is
safest.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-17 20:07:36
Message-ID: 45AE8208.9000206@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> BTW, I'm thinking that a "cost constant" probably ought to be measured
> in units of cpu_operator_cost. The default for built-in functions would
> thus be 1, at least till such time as someone wants to refine the
> estimates. We'd probably want the default for PL and SQL functions to
> be 10 or 100 or so.

Any chance that costs could eventually change to real-world units?

It's hard for me to guess how many cpu_operator_cost units
something might take; but relatively easy for me to measure
or estimate in fractions-of-a-seconds how long something takes.

I could imagine having the other planner costs be measured in seconds
too - perhaps with the goal of eventually writing some auto-config
code that tries to measure values like cpu_tuple_cost on a given
piece of hardware.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-17 20:13:26
Message-ID: 4748.1169064806@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
> Tom Lane wrote:
>> BTW, I'm thinking that a "cost constant" probably ought to be measured
>> in units of cpu_operator_cost.

> Any chance that costs could eventually change to real-world units?

Define "real world units".

If you like you can try to adjust things so that cpu_operator_cost bears
some relation to elapsed-msec-on-your-own-machine, and scale everything
else accordingly; indeed that's why we added seq_page_cost in 8.2.
But it's awfully hard to see the numbers being very portable.

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Richard Troy <rtroy(at)ScienceTools(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-18 00:37:53
Message-ID: 1169080673.19505.2.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-01-15 at 13:54 -0500, Neil Conway wrote:
> On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote:
> > I therefore propose that the engine evaluate -
> > benchmark, if you will - all functions as they are ingested, or
> > vacuum-like at some later date (when valid data for testing may exist),
> > and assign a cost relative to what it already knows - the built-ins, for
> > example.
>
> That seems pretty unworkable. It is unsafe, for one: evaluating a
> function may have side effects (inside or outside the database), so the

Would any form of cost estimate have meaning if the function has side
effects? If it's a volatile function, doesn't that mean that the planner
can't avoid or favor executing it?

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-18 00:50:31
Message-ID: 15293.1169081431@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Would any form of cost estimate have meaning if the function has side
> effects? If it's a volatile function, doesn't that mean that the planner
> can't avoid or favor executing it?

No, not really. If the function is down inside a sub-select or
something like that, the number of executions could depend on any number
of factors (like whether we put it on the inside or outside of a join)
--- and if it's expensive then we will and should try to arrange the
query to minimize the number of executions. We're certainly not going
to drop back to all-plain-nestloops just because the query contains one
volatile function.

(Now mind you, a smart user will probably avoid using a volatile
function like that, but I don't think it's an adequate argument for
saying that we don't need cost information.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Brian Hurt <bhurt(at)janestcapital(dot)com>, Richard Troy <rtroy(at)ScienceTools(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-21 01:39:13
Message-ID: 17062.1169343553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway <neilc(at)samurai(dot)com> writes:
> On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
>> maybe we should just do the constant for starters and see how many
>> people really want to write C-code estimators ...

> +1

It seemed like that was the list's consensus, so I'll go off and do the
simple-constant case. We can always extend it later.

For reference, the plan is to add these options to CREATE/ALTER
FUNCTION:

COST float-constant (defaults to 1)
ROWS float-constant (defaults to 1000)

feeding into two float4 columns added to pg_proc. ROWS is only
allowed/meaningful for set-returning functions. COST is implicitly
scaled by cpu_operator_cost.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-22 03:51:39
Message-ID: 1998.1169437899@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I complained about how:
> The query is

> SELECT p1.opcname, p1.opcfamily
> FROM pg_opclass AS p1
> WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2
> WHERE p2.amopfamily = p1.opcfamily
> AND binary_coercible(p1.opcintype, p2.amoplefttype));

> and investigation showed that the plan changed from (8.2 and before)

> Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
> Filter: (NOT (subplan))
> SubPlan
> -> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
> Filter: ((amopfamily = $0) AND binary_coercible($1, amoplefttype))

> to

> Seq Scan on pg_opclass p1 (cost=0.00..393.94 rows=51 width=68)
> Filter: (NOT (subplan))
> SubPlan
> -> Seq Scan on pg_amop p2 (cost=0.00..7.66 rows=2 width=0)
> Filter: (binary_coercible($1, amoplefttype) AND (amopfamily = $0))

Now that some function-cost smarts are in there, I expected to see the
plan go back to the first case, but what I actually see in CVS HEAD is

Seq Scan on pg_opclass p1 (cost=0.00..660.35 rows=51 width=68)
Filter: (NOT (subplan))
SubPlan
-> Bitmap Heap Scan on pg_amop p2 (cost=4.29..8.60 rows=2 width=0)
Recheck Cond: (amopfamily = $0)
Filter: binary_coercible($1, amoplefttype)
-> Bitmap Index Scan on pg_amop_fam_strat_index (cost=0.00..4.29 rows=5 width=0)
Index Cond: (amopfamily = $0)

The reason this happens is that cost_qual_eval charges the entire cost of
evaluating all the arms of an AND, even though we'll drop out as soon as
something returns FALSE; and so the planner is led to avoid the seqscan
because it now appears to have a high filter-condition evaluation cost,
in favor of a plan that will evaluate the filter condition many fewer
times. In reality those two plans will call binary_coercible() exactly
the same number of times, and so this is a bogus reason to switch.

I'm kind of inclined to leave it alone though, because the second plan
seems a bit more "failsafe". To do anything differently, we'd have to
order the qual conditions the way we expect to execute them before
any use of cost_qual_eval, which sounds expensive; and as noted in
an upthread discussion with Greg, relying on the correctness of *both*
cost and selectivity estimates seems a tad fragile.

Comments?

regards, tom lane


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-29 02:05:48
Message-ID: 45BD567C.5080604@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Would a simple constant value be workable, or do we need some more
> complex model (and if so what)?

Consider:

ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;

ANALYZE myfunc(text,text) ON (SELECT myfunc(mt.a,mt.b) FROM mytable mt) WITH
RATIO 1.071;

ANALYZE myfunc(text,text) ON (
SELECT myfunc(mt.a,mt.b) FROM mytable mt
UNION
SELECT myfunc(ot.a,ot.b) FROM othertable ot
) WITH RATIO 0.5;

These commands could turn on function timing for the lifespan of the query, with
statistics gathered about the given function's runtimes. The "WITH RATIO"
clause would be there to translate runtimes (in milliseconds) into units of
cpu_operator_cost. The "WITH RATIO" clause could be optional, with a default
ratio taken from the postgresql.conf file, if any exists, and finally defaulting
to a hardcoded "reasonable" value. Users would be well advised to adopt a
consistent policy regarding system load at the time that various analyze
functions are run.

If the function has side effects, it would be the user's responsibility to not
analyze the function unless those side effects are acceptable. The user can
only analyze those queries that the user has permissions to run, so there
shouldn't be any additional ability to generate side-effects beyond what the
user already has permission to do.

The syntax might need some adjusting to make the parser happy and to avoid new
reserved words. The syntax used above is just an example.

It seems to me that the above system would work perfectly well for collecting
the number of rows returned from a set returning function, not just the run times.

mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Function execution costs 'n all that
Date: 2007-01-29 02:15:32
Message-ID: 27442.1170036932@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?

> Consider:
> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
> ...
> It seems to me that the above system would work perfectly well for
> collecting the number of rows returned from a set returning function,
> not just the run times.

I don't really think that data collection is the bottleneck. If a
constant estimate isn't good enough for you, then you need some kind of
model of how the runtime or number of rows varies with the function's
inputs ... and I hardly see how something like the above is likely to
figure out how to fit a good model. Or at least, if you think it can,
then you skipped all the interesting bits.

One other point is that we already know that sampling overhead and
measurement error are significant problems when trying to measure
intervals on the order of one Plan-node execution. I'm afraid that
would get a great deal worse if we try to use a similar approach to
timing individual function calls.

regards, tom lane


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Function execution costs 'n all that
Date: 2007-01-29 02:50:19
Message-ID: 45BD60EB.2060106@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>> Tom Lane wrote:
>>> Would a simple constant value be workable, or do we need some more
>>> complex model (and if so what)?
>
>> Consider:
>> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
>> ...
>> It seems to me that the above system would work perfectly well for
>> collecting the number of rows returned from a set returning function,
>> not just the run times.
>
> I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection. I'm thinking about usability
for non-hackers who know enough plpgsql to write a function and then want to
train the system to plan for it appropriately. It's a much easier task for a
novice user to say "go away and figure out how expensive this thing is" than for
a novice to think about things like statistical variance, etc. We don't demand
that users have that kind of knowledge to write queries or run analyze on
tables, so why would they need that kind of knowledge to write a function?

> If a
> constant estimate isn't good enough for you, then you need some kind of
> model of how the runtime or number of rows varies with the function's
> inputs ... and I hardly see how something like the above is likely to
> figure out how to fit a good model. Or at least, if you think it can,
> then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over the
same query as the one that will actually get used most often in production. In
the case that the query modifies the table, the user could train the database
over a copy of that table. The data collected by the analyze phase would just
be constant stuff like average and stddev. That would make the job of the
planner / cost estimator easier, right? It could treat the function as a
constant cost function.

> One other point is that we already know that sampling overhead and
> measurement error are significant problems when trying to measure
> intervals on the order of one Plan-node execution. I'm afraid that
> would get a great deal worse if we try to use a similar approach to
> timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to a
temporary table. After the query is complete (and the temporary table
populated), data from the temp table could be pulled into memory in batches,
with the "myfunc" run on them again in a tight loop. The loop itself could be
timed, rather than each iteration. The sum of all the timings for the various
loops would then be the final runtime which would be divided by the total number
of rows to get the average runtime per call. The downside is that I don't see
how you retrieve the standard deviation. (I also don't know if the planner
knows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of the
input data. For text, you could run a batch of short text first, a batch of
medium second, and a batch of long text last, and use best-fit linear algebra to
determine the runtime cost vs. input text length function. I'm not sure how
such a refinement would be done for fixed size datatypes. And for some text
functions the runtime won't vary with length but with some other property anyway.

mark