What about utility to calculate planner cost constants?

Lists: pgsql-performance
From: "Tambet Matiisen" <t(dot)matiisen(at)aprote(dot)ee>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: What about utility to calculate planner cost constants?
Date: 2005-03-21 10:05:56
Message-ID: A66A11DBF5525341AEF6B8DE39CDE77008805C@black.aprote.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


I was following the cpu_tuple_cost thread and wondering, if it could be
possible to make PQA style utility to calculate configuration-specific
values for planner cost constants. It could make use of output of
log_(statement|parser|planner|executor)_stats, tough I'm not sure if the
output contains anything useful for those purposes.

Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
of them and then play with planner cost constants to get the estimated
values as close as possible to actual values. Something like Goal Seek
in Excel, if you pardon my reference to MS :).

Somewhat similar project seems to be pgautotune from pgFoundry, but it
only considers shared_buffers, sort_mem and vacuum_mem. And it seems to
use synthetic data instead of actual database and actual statements from
log. And it has been inactive for a while.

Tambet


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "Tambet Matiisen" <t(dot)matiisen(at)aprote(dot)ee>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-21 17:51:06
Message-ID: 200503210951.06044.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tambet,

> I was following the cpu_tuple_cost thread and wondering, if it could be
> possible to make PQA style utility to calculate configuration-specific
> values for planner cost constants. It could make use of output of
> log_(statement|parser|planner|executor)_stats, tough I'm not sure if the
> output contains anything useful for those purposes.

Yeah, that's something I need to look at.

> Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
> of them and then play with planner cost constants to get the estimated
> values as close as possible to actual values. Something like Goal Seek
> in Excel, if you pardon my reference to MS :).

That's not really practical. There are currently 5 major query tuning
parameters, not counting the memory adjustments which really can't be left
out. You can't realistically test all combinations of 6 variables.

> Somewhat similar project seems to be pgautotune from pgFoundry, but it
> only considers shared_buffers, sort_mem and vacuum_mem. And it seems to
> use synthetic data instead of actual database and actual statements from
> log. And it has been inactive for a while.

Yeah, pg_autotune is a dead project. Once we got OSDL able to run tests, we
came up with some rules-of-thumb which are more useful than autotune's
output. More importantly, the approach doesn't scale to the 15-20 GUCs which
we'd realistically want to test.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Thomas F(dot)O'Connell <tfo(at)sitening(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Tambet Matiisen" <t(dot)matiisen(at)aprote(dot)ee>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-21 18:18:17
Message-ID: 11322edfb476c82809712b9796f78a5f@sitening.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

If by not practical you mean, "no one has implemented a multivariable
testing approach," I'll agree with you. But multivariable testing is
definitely a valid statistical approach to solving just such problems.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Mar 21, 2005, at 11:51 AM, Josh Berkus wrote:

> That's not really practical. There are currently 5 major query tuning
> parameters, not counting the memory adjustments which really can't be
> left
> out. You can't realistically test all combinations of 6 variables.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Thomas F(dot)O'Connell <tfo(at)sitening(dot)com>
Cc: "Tambet Matiisen" <t(dot)matiisen(at)aprote(dot)ee>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-21 22:59:56
Message-ID: 200503211459.56243.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thomas,

> If by not practical you mean, "no one has implemented a multivariable
> testing approach," I'll agree with you. But multivariable testing is
> definitely a valid statistical approach to solving just such problems.

Well, not practical as in: "would take either $10 million in equipment or
10,000 hours or both"

--Josh

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "Tambet Matiisen" <t(dot)matiisen(at)aprote(dot)ee>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 04:42:50
Message-ID: 87acowcmut.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> > Otherwise it could just collect statements, run EXPLAIN ANALYZE for all
> > of them and then play with planner cost constants to get the estimated
> > values as close as possible to actual values. Something like Goal Seek
> > in Excel, if you pardon my reference to MS :).
>
> That's not really practical. There are currently 5 major query tuning
> parameters, not counting the memory adjustments which really can't be left
> out. You can't realistically test all combinations of 6 variables.

I don't think it would be very hard at all actually.

It's just a linear algebra problem with a bunch of independent variables and a
system of equations. Solving for values for all of them is a straightforward
problem.

Of course in reality these variables aren't actually independent because the
costing model isn't perfect. But that wouldn't be a problem, it would just
reduce the accuracy of the results.

What's needed is for the explain plan to total up the costing penalties
independently. So the result would be something like

1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost
+ ...

In other words a tuple like <1000,101,2000,...>

And explain analyze would produce the above tuple along with the resulting
time.

Some program would have to gather these values from the log or stats data and
gather them up into a large linear system and solve for values that minimize
the divergence from the observed times.

(costs penalties are currently normalized to sequential_page_cost being 1.
That could be maintained, or it could be changed to be normalized to an
expected 1ms.)

(Also, currently explain analyze has overhead that makes this impractical.
Ideally it could subtract out its overhead so the solutions would be accurate
enough to be useful)

--
greg


From: Dawid Kuroczko <qnex42(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 09:58:43
Message-ID: 758d5e7f0503220158cec718e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, 21 Mar 2005 14:59:56 -0800, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > If by not practical you mean, "no one has implemented a multivariable
> > testing approach," I'll agree with you. But multivariable testing is
> > definitely a valid statistical approach to solving just such problems.
> Well, not practical as in: "would take either $10 million in equipment or
> 10,000 hours or both"

I think you don't need EXPLAIN ANALYZE each query with different GUCs,
you would only need EXPLAIN most of the times (which is way quicker).
Once you get 'near' actual values, you would do EXECUTE ANALYZE to
verify the variables.

Regards,
Dawid


From: Richard Huxton <dev(at)archonet(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Tambet Matiisen <t(dot)matiisen(at)aprote(dot)ee>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 11:56:24
Message-ID: 424007E8.1020700@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>That's not really practical. There are currently 5 major query tuning
>>parameters, not counting the memory adjustments which really can't be left
>>out. You can't realistically test all combinations of 6 variables.
>
> I don't think it would be very hard at all actually.
[snip]
> What's needed is for the explain plan to total up the costing penalties
> independently. So the result would be something like
>
> 1000 * random_page_cost + 101 * sequential_page_cost + 2000 * index_tuple_cost
> + ...
>
> In other words a tuple like <1000,101,2000,...>
>
> And explain analyze would produce the above tuple along with the resulting
> time.
>
> Some program would have to gather these values from the log or stats data and
> gather them up into a large linear system and solve for values that minimize
> the divergence from the observed times.

You'd only need to log them if they diverged from expected anyway. That
should result in fairly low activity pretty quickly (or we're wasting
our time). Should they go to the stats collector rather than logs?

> (Also, currently explain analyze has overhead that makes this impractical.
> Ideally it could subtract out its overhead so the solutions would be accurate
> enough to be useful)

Don't we only need the top-level figures though? There's no need to
record timings for each stage, just work completed.

--
Richard Huxton
Archonet Ltd


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 13:09:40
Message-ID: m34qf3aktn.fsf@knuth.knuth.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Martha Stewart called it a Good Thing when gsstark(at)mit(dot)edu (Greg Stark) wrote:
> I don't think it would be very hard at all actually.
>
> It's just a linear algebra problem with a bunch of independent
> variables and a system of equations. Solving for values for all of
> them is a straightforward problem.
>
> Of course in reality these variables aren't actually independent
> because the costing model isn't perfect. But that wouldn't be a
> problem, it would just reduce the accuracy of the results.

Are you certain it's a linear system? I'm not. If it was a matter of
minimizing a linear expression subject to some set of linear
equations, then we could model this as a Linear Program for which
there are some perfectly good solvers available. (Few with BSD-style
licenses, but we could probably get some insight out of running for a
while with something that's there...)

I think there's good reason to consider it to be distinctly
NON-linear, which makes it way more challenging to solve the problem.

There might well be some results to be gotten out of a linear
approximation; the Grand Challenge is to come up with the model in the
first place...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','gmail.com').
http://linuxdatabases.info/info/or.html
"Tom Christiansen asked me, "Chip, is there anything that you like
that isn't big and complicated?" C++, EMACS, Perl, Unix, English-no, I
guess not." -- Chip Salzenberg, when commenting on Perl6/C++


From: Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 14:00:50
Message-ID: 20050322140050.GQ3087@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, Mar 22, 2005 at 08:09:40AM -0500, Christopher Browne wrote:
> Martha Stewart called it a Good Thing when gsstark(at)mit(dot)edu (Greg Stark) wrote:
> > I don't think it would be very hard at all actually.
> >
> > It's just a linear algebra problem with a bunch of independent
> > variables and a system of equations. Solving for values for all of
> > them is a straightforward problem.
> >
> > Of course in reality these variables aren't actually independent
> > because the costing model isn't perfect. But that wouldn't be a
> > problem, it would just reduce the accuracy of the results.
>
> Are you certain it's a linear system? I'm not. If it was a matter of
> minimizing a linear expression subject to some set of linear
> equations, then we could model this as a Linear Program for which
> there are some perfectly good solvers available. (Few with BSD-style
> licenses, but we could probably get some insight out of running for a
> while with something that's there...)
>
> I think there's good reason to consider it to be distinctly
> NON-linear, which makes it way more challenging to solve the problem.
>
Non-linear optimization works very well in many cases. Issues such
as local minima can be addressed. In a sense, the planner output
can be treated as a blackbox function and the "goodness" of the
solution is how well it approximates the actual query times. In this
case, it will be imperative to constrain some of the values to prevent
"crazy" configurations.

Ken


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 15:17:07
Message-ID: 20050322151707.GA5486@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, Mar 22, 2005 at 08:09:40 -0500,
Christopher Browne <cbbrowne(at)acm(dot)org> wrote:
>
> Are you certain it's a linear system? I'm not. If it was a matter of
> minimizing a linear expression subject to some set of linear
> equations, then we could model this as a Linear Program for which
> there are some perfectly good solvers available. (Few with BSD-style
> licenses, but we could probably get some insight out of running for a
> while with something that's there...)

For less than 100 equations and 100 unknowns, you should be able to use
naive solvers. After that you don't get very accurate answers without
being more careful. I still have my numerical analysis text books around
and can look algorithms up for doing this without too much trouble.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Tambet Matiisen <t(dot)matiisen(at)aprote(dot)ee>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 16:19:40
Message-ID: 87y8cfbqlf.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Richard Huxton <dev(at)archonet(dot)com> writes:

> You'd only need to log them if they diverged from expected anyway. That should
> result in fairly low activity pretty quickly (or we're wasting our time).
> Should they go to the stats collector rather than logs?

I think you need to log them all. Otherwise when you go to analyze the numbers
and come up with ideal values you're going to be basing your optimization on a
skewed subset.

I don't know whether the stats collector or the logs is better suited to this.

> > (Also, currently explain analyze has overhead that makes this impractical.
> > Ideally it could subtract out its overhead so the solutions would be accurate
> > enough to be useful)
>
> Don't we only need the top-level figures though? There's no need to record
> timings for each stage, just work completed.

I guess you only need top level values. But you also might want some flag if
the row counts for any node were far off. In that case perhaps you would want
to discard the data point.

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 17:30:05
Message-ID: 87hdj3bnc2.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Christopher Browne <cbbrowne(at)acm(dot)org> writes:

> Are you certain it's a linear system?

If you just consider the guc parameters that tell postgres how long various
real world operations take (all the *_cost parameters) then it's a linear
system. It has to be. The resulting time is just a sum of the times for some
number of each of these real world operations.

If you include parameters like the geqo_* parameters or the hypothetical
parameter that controls what selectivity to assume for clauses with unknown
selectivity then no, it wouldn't be.

But if you assume the estimated row counts are correct and you're just trying
to solve for the parameters to come up with the most accurate cost for the
current hardware then I think you're golden.

> There might well be some results to be gotten out of a linear
> approximation; the Grand Challenge is to come up with the model in the
> first place...

Indeed. The model's not perfect now of course, and it'll never really be
perfect since some of the parameters represent operations that aren't always a
consistent cost. But you should be able to solve for the values that result in
the most accurate totals the most often. There may be some tradeoffs (and
therefore new guc variables :)

PS

It occurs to me that there's no reason to use the unreliable EXPLAIN counts of
the costs. You may as well account accurately for them and use the actual
values used in performing the query. This means there's no reason to discard
inaccurately estimated data points.

Moreover, the overhead issue a non-issue. Since you only need the total time,
and the total costs. You would have the overhead of performing lots of
increments on those costs, but you only have to do two gettimeofdays. Once at
the beginning and once at the end.

--
greg


From: Richard Huxton <dev(at)archonet(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Tambet Matiisen <t(dot)matiisen(at)aprote(dot)ee>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 17:53:05
Message-ID: 42405B81.20107@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark wrote:
> Richard Huxton <dev(at)archonet(dot)com> writes:
>
>>You'd only need to log them if they diverged from expected anyway. That should
>>result in fairly low activity pretty quickly (or we're wasting our time).
>>Should they go to the stats collector rather than logs?
>
> I think you need to log them all. Otherwise when you go to analyze the numbers
> and come up with ideal values you're going to be basing your optimization on a
> skewed subset.

I can see your thinking, I must admit I was thinking of a more iterative
process: estimate deltas, change config, check, repeat. I'm not
convinced there are "ideal" values with a changing workload - for
example, random_page_cost will presumably vary depending on how much
contention there is for random seeks. Certainly, effective_cache size
could vary.

> I don't know whether the stats collector or the logs is better suited to this.
>
>>>(Also, currently explain analyze has overhead that makes this impractical.
>>>Ideally it could subtract out its overhead so the solutions would be accurate
>>>enough to be useful)
>>
>>Don't we only need the top-level figures though? There's no need to record
>>timings for each stage, just work completed.
>
> I guess you only need top level values. But you also might want some flag if
> the row counts for any node were far off. In that case perhaps you would want
> to discard the data point.

I think you'd need to adjust work-estimates by actual-rows / estimated-rows.

I _was_ trying to think of a clever way of using row mis-estimates to
correct statistics automatically. This was triggered by the discussion a
few weeks ago about hints to the planner and the recent talk about plan
cacheing. Some sort of feedback loop so the planner would know its
estimates were off should be a big win from an ease-of-use point of
view. Didn't look easy to do though :-(

--
Richard Huxton
Archonet Ltd


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 18:34:37
Message-ID: 10759.1111516477@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Christopher Browne <cbbrowne(at)acm(dot)org> writes:
> Martha Stewart called it a Good Thing when gsstark(at)mit(dot)edu (Greg Stark) wrote:
>> It's just a linear algebra problem with a bunch of independent
>> variables and a system of equations. Solving for values for all of
>> them is a straightforward problem.

> Are you certain it's a linear system? I'm not.

I'm quite certain it isn't a linear system, because the planner's cost
models include nonlinear equations.

While I don't have a whole lot of hard evidence to back this up, my
belief is that our worst problems stem not from bad parameter values
but from wrong models. In particular we *know* that the cost model for
nestloop-inner-indexscan joins is wrong, because it doesn't account for
cacheing effects across repeated scans. There are some other obvious
weak spots as well. It could be argued that we ought to allow the
system to assume index cacheing even for standalone queries, on the
grounds that if you are doing a query often enough to care about it,
there was probably a recent use of the same query that pulled in the
upper index levels. The current cost models all assume starting from
ground zero with empty caches for each query, and that is surely not
reflective of many real-world cases.

I've looked at fixing this a couple times, but so far my attempts
to devise a more believable index access cost estimator have come
out with numbers higher than the current estimates ... not the
direction we want it to go :-(

Anyway, I see little point in trying to build an automatic parameter
optimizer until we have cost models whose parameters are more stable
than the current ones.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 18:56:10
Message-ID: 10950.1111517770@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Christopher Browne <cbbrowne(at)acm(dot)org> writes:
>> Are you certain it's a linear system?

> If you just consider the guc parameters that tell postgres how long various
> real world operations take (all the *_cost parameters) then it's a linear
> system. It has to be.

No it doesn't. Think caching effects for instance. We do have cache
effects in the cost models, even though they are wrong as per my nearby
rant...

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 21:28:18
Message-ID: 87is3j9xql.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Christopher Browne <cbbrowne(at)acm(dot)org> writes:
> > Martha Stewart called it a Good Thing when gsstark(at)mit(dot)edu (Greg Stark) wrote:
> >> It's just a linear algebra problem with a bunch of independent
> >> variables and a system of equations. Solving for values for all of
> >> them is a straightforward problem.
>
> > Are you certain it's a linear system? I'm not.
>
> I'm quite certain it isn't a linear system, because the planner's cost
> models include nonlinear equations.

The equations will all be linear for the *_cost variables. If they weren't
they would be meaningless, the units would be wrong. Things like caching are
just going to be the linear factors that determine how many random page costs
and sequential page costs to charge the query.

> While I don't have a whole lot of hard evidence to back this up, my
> belief is that our worst problems stem not from bad parameter values
> but from wrong models.

I think these are orthogonal issues.

The time spent in real-world operations like random page accesses, sequential
page accesses, cpu operations, index lookups, etc, are all measurable
quantities. They can be directly measured or approximated by looking at the
resulting net times. Measuring these things instead of asking the user to
provide them is just a nicer user experience.

Separately, plugging these values into more and more accurate model will come
up with better estimates for how many of these operations a query will need to
perform.

> Anyway, I see little point in trying to build an automatic parameter
> optimizer until we have cost models whose parameters are more stable
> than the current ones.

Well what people currently do is tweak the physical values until the produce
results for their work load that match reality. It would be neat if postgres
could do this automatically.

Arguably the more accurate the cost model the less of a motivation for
automatic adjustments there is since you could easily plug in accurate values
from the hardware specs. But actually I think it'll always be a nice feature.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 21:48:15
Message-ID: 14922.1111528095@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Greg Stark <gsstark(at)mit(dot)edu> writes:
> The time spent in real-world operations like random page accesses, sequential
> page accesses, cpu operations, index lookups, etc, are all measurable
> quantities. They can be directly measured or approximated by looking at the
> resulting net times.

That's the theory, all right, and that's why I've been resistant to
lowering random_page_cost just because "it gives better answers".
To the extent that you believe that is a real physical parameter with
a definable value (which is a bit debatable actually, but nevermind)
it should be possible to measure it by experiment.

The difficulty with the notion of doing that measurement by timing
Postgres operations is that it's a horribly bad experimental setup.
You have no way to isolate the effects of just one variable, or even
a small number of variables, which you really need to do if you want
to estimate with any degree of precision. What's more, there are plenty
of relevant factors that aren't in the model at all (such as the extent
of other load on the machine), and so the noise in the measurements
will be enormous.

And you can't just dismiss the issue of wrong cost models and say we can
get numbers anyway. We see examples almost every day on this list where
the planner is so far off about indexscan vs seqscan costs that you'd
have to put random_page_cost far below 1 to make its numbers line up
with reality. That's not a matter of tuning the parameter, it's
evidence that the cost model is wrong. If you try to solve for the
"right value" of the parameter by comparing estimated and actual costs,
you'll get garbage, even without any issues about noisy measurements
or numerical instability of your system of equations.

regards, tom lane


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Subject: Re: What about utility to calculate planner cost constants?
Date: 2005-03-22 22:59:44
Message-ID: 4240A360.1030904@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
>
> And you can't just dismiss the issue of wrong cost models and say we can
> get numbers anyway.

Is there a way to see more details about the cost estimates.
EXPLAIN ANALYZE seems to show the total time and rows; but not
information like how many disk pages were accessed.

I get the feeling that sometimes the number of rows is estimated
very well, but the amount of disk I/O is way off.

Sometimes the number of pages read/written is grossly
overestimated (if tables lave a lot of locally clustered data)
or underestimated if a sort barely exceeds sort_mem.

Perhaps an EXPLAN ANALYZE VERBOSE that would add info like this:

Index scan ([...]estimated 1000 pages read) (actual[...] 10 pages read)

would help track those down?