Quick Links

Re: The science of optimization in practical terms?

Lists:	pgsql-hackers

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	The science of optimization in practical terms?
Date:	2009-02-13 00:06:31
Message-ID:	1234483591.9467.188.camel@jd-laptop.pragmaticzealot.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

I was helping a customer today with what is becoming a common theme with
a lot of work we do. Basically, "It was working fine until recently."
Now 90% of the time it is as simple as running an ANALYZE VERBOSE and
picking apart relations that aren't being maintained properly and adjust
autovacuum or vacuum appropriately. If it isn't that, it is usually
something like increasing effective_cache_size, or
default_statistics_target.

However, in recent times I have found that increasing cpu_tuple_cost,
cpu_operator_cost and cpu_index_tuple_cost to be very useful. This is
always in the scenario of, "queries were running fine for months and
then all of a sudden, they are not". It is also always on systems that
we are already maintaining and thus (in theory) are in good shape.

So my question is, what is the science in practical terms behind those
parameters? Normally I would just accept it as another PostgreSQL
idiosyncrasy but the performance differences I am talking about are
large. After changing cpu_tuple_cost and cpu_operator_cost today to 0.5
I decreased two queries from 10 seconds and 15 seconds to 2 seconds and
~900 ms respectively.

Sincerely,

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

From:	Bernd Helmle <mailings(at)oopsware(dot)de>
To:	jd(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-13 13:05:41
Message-ID:	21D3E80B28D62569C2E8DFF6@teje
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--On Donnerstag, Februar 12, 2009 16:06:31 -0800 "Joshua D. Drake"
<jd(at)commandprompt(dot)com> wrote:

> However, in recent times I have found that increasing cpu_tuple_cost,
> cpu_operator_cost and cpu_index_tuple_cost to be very useful. This is
> always in the scenario of, "queries were running fine for months and
> then all of a sudden, they are not". It is also always on systems that
> we are already maintaining and thus (in theory) are in good shape.

Hmm have you tried seq_page_cost and random_page_cost, too? I found them
really important especially if you have a steadily growing database or a
fully cached database to reflect the "real" disk access costs.

--
Thanks

Bernd

From:	Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To:	Bernd Helmle <mailings(at)oopsware(dot)de>
Cc:	jd(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-13 20:10:44
Message-ID:	2BBB3EB3-A155-4AE4-BCF1-0F2B622D8D36@pointblue.com.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

yet more arguments, to let postgresql estimate those automatically.

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
Cc:	Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-13 20:27:39
Message-ID:	1234556859.7908.22.camel@jd-laptop.pragmaticzealot.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-02-13 at 20:10 +0000, Grzegorz Jaskiewicz wrote:
> yet more arguments, to let postgresql estimate those automatically.
>

Well I haven't seen any arguments actually. Which was the point of my
original question. I don't think anyone actually knows what these knobs
change, in practice.

Joshua D. Drake

>
--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

From:	Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To:	jd(at)commandprompt(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-13 20:56:42
Message-ID:	1234558602.7339.8.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-12 at 16:06 -0800, Joshua D. Drake wrote:
> Hello,
>
> I was helping a customer today with what is becoming a common theme with
> a lot of work we do. Basically, "It was working fine until recently."
> Now 90% of the time it is as simple as running an ANALYZE VERBOSE and
> picking apart relations that aren't being maintained properly and adjust
> autovacuum or vacuum appropriately. If it isn't that, it is usually
> something like increasing effective_cache_size, or
> default_statistics_target.
>
> However, in recent times I have found that increasing cpu_tuple_cost,
> cpu_operator_cost and cpu_index_tuple_cost to be very useful. This is
> always in the scenario of, "queries were running fine for months and
> then all of a sudden, they are not". It is also always on systems that
> we are already maintaining and thus (in theory) are in good shape.
>
> So my question is, what is the science in practical terms behind those
> parameters?

In most general terms increasing these will favor index access over
seqscans (well, increasing _only_ cpu_index_tuple_cost won't but still).

Basically - for any database with multiple user growing anywere near
where it will not fit mostly in cache - it is a good rule of a thumb to
push postgres in direction of selecting plans with mostly index access,
as postgreSQL's planner is not much aware of other queries running in
parallel and thus can not do much by itself know that it should.

Things are fast until main tables stay in cache, preferably in pg shared
memory byt os cache is also quite ok and then suddenly deteriorate, once
some part of it does not.

You can watch for these things by monitoring pg_stat_user_tables /
pg_stat_user_indexes and make educated guesses if som etables should or
should not have that much traffic of the type indicated there.

> Normally I would just accept it as another PostgreSQL
> idiosyncrasy but the performance differences I am talking about are
> large. After changing cpu_tuple_cost and cpu_operator_cost today to 0.5
> I decreased two queries from 10 seconds and 15 seconds to 2 seconds and
> ~900 ms respectively.
>
> Sincerely,
>
> Joshua D. Drake
>
>
> --
> PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
> Consulting, Development, Support, Training
> 503-667-4564 - http://www.commandprompt.com/
> The PostgreSQL Company, serving since 1997
>

--
------------------------------------------
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	jd(at)commandprompt(dot)com
Cc:	Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-13 21:20:11
Message-ID:	603c8f070902131320n47904f8cr2f95a5f01e496e85@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 13, 2009 at 3:27 PM, Joshua D. Drake <jd(at)commandprompt(dot)com> wrote:
> On Fri, 2009-02-13 at 20:10 +0000, Grzegorz Jaskiewicz wrote:
>> yet more arguments, to let postgresql estimate those automatically.
>
> Well I haven't seen any arguments actually. Which was the point of my
> original question. I don't think anyone actually knows what these knobs
> change, in practice.

Well, in broad strokes, it seems to me that what they do is fairly
obvious: they affect the planner's willingness to choose plans that
touch more pages vs. plans that involve more CPU overhead (e.g. qual
evaluation). If the database is mostly or entirely in shared buffers
or the system buffer cache, and CPU consumption is a problem, then
raising the CPU costs is apt to help.

I think the root of this problem is that we can't model caching
effects. random_page_cost > seq_page_cost models the cost of seeks,
but min(random_page_cost, seq_page_cost) >> max(cpu_tuple_cost,
cpu_index_tuple_cost, cpu_operator_cost) models the fact that read
from disk, even sequentially, is always slow. Unfortunately, if the
whole database is likely already in memory, which seems to be a pretty
common scenario even for relatively large databases (because people
buy more memory to make them fit), then it's just wrong.

If we had a good knowledge of which pages were apt to be cached, we
could add a GUC cached_page_cost with a default value of maybe 0.2,
and presumably we'd get better plans that way. The bad news is that
it's pretty difficult to get that knowledge (and of course it could
change after the fact if the usage pattern of the database shifts
dramatically). The good news is that experimentation is possible.
For example, we could:

- Assume that small relations are more likely to be cached (so derate
page costs when accessing them).
- Allow users to override the page cost on a per-rel basis using a reloption.
- Gather statistics on relation access patterns and use that to
estimate the fraction of a relation likely to be in cache.

If your whole database stays in memory all the time, I would guess
that you could either raise the CPU costs or drop the page costs quite
substantially and that would probably work out fine. What's tougher
is to still be able to generate good plans when only part of the
database fits in memory, or there's other activity on the system that
is periodically purging portions of the system cache.

...Robert

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-15 18:16:32
Message-ID:	Pine.GSO.4.64.0902151238070.1312@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 13 Feb 2009, Robert Haas wrote:

> Gather statistics on relation access patterns and use that to estimate
> the fraction of a relation likely to be in cache.

At one point I had a hacked background writer that collected statistics
about the contents of the buffer cache. Since it's obtaining a lock on
the buffer header anyway, it's a perfectly good place to note what
relfileid the buffer is associated with. If you set aside some fixed
amount of space to hold information about the most popular relations
(perhaps using a continuous top-k model, see
http://www.mysmu.edu/faculty/kyriakos/topk-SIGMOD06.pdf ), you can end up
with enough data to estimate how much data in shared_buffers exists for
the most cached relations in there.

In a typical recommended tuning nowadays, we can only expect that to
sample about 1/3 of the total caching happening (presuming
shared_buffers=1/4 RAM and effective_cache_size~=3/4 RAM). While in
general it's nice to think that shared_buffers has a similar makeup to
what the OS is caching, it's not hard to discover common cases where this
would not be the case. Particularly given the VACUUM/seq scan ring-buffer
improvements in 8.3, it's easy to imagine scanning a table that's
2*shared_buffers in size showing only 256KB in shared_buffers, while the
whole thing is available in the OS cache.

I had a eureka moment where I realized I could hook the buffer eviction
code to model that. Decrement the count for that relation in the main
top-k count, then have a second count that assumes the last
2*shared_buffers evicted are also still cached. That would accurately
model the ring-buffer case and improve the quality of the model in
general. Updating those stats on every eviction would add some overhead,
but if the background writer is doing enough of them for you that should
at least be asynchronous from when most backends are blocked waiting for
an eviction.

And that's as far as I got before I had to return to real work again.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	<jd(at)commandprompt(dot)com>,"Bernd Helmle" <mailings(at)oopsware(dot)de>, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-15 19:32:02
Message-ID:	49981952.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> have a second count that assumes the last
> 2*shared_buffers evicted are also still cached.

Perhaps it would be better to assume that the external cache is
effective_cache_size - shared_buffers? Of course, we would need to
have some heuristics to cover odd settings (like effective_cache_size
less than shared_buffers).

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-16 03:54:13
Message-ID:	603c8f070902151954v42cce4d0qf68dbec090a64357@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 15, 2009 at 1:16 PM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> On Fri, 13 Feb 2009, Robert Haas wrote:
>> Gather statistics on relation access patterns and use that to estimate the
>> fraction of a relation likely to be in cache.
>
> At one point I had a hacked background writer that collected statistics
> about the contents of the buffer cache. Since it's obtaining a lock on the
> buffer header anyway, it's a perfectly good place to note what relfileid the
> buffer is associated with. If you set aside some fixed amount of space to
> hold information about the most popular relations (perhaps using a
> continuous top-k model, see
> http://www.mysmu.edu/faculty/kyriakos/topk-SIGMOD06.pdf ), you can end up
> with enough data to estimate how much data in shared_buffers exists for the
> most cached relations in there.
>
> In a typical recommended tuning nowadays, we can only expect that to sample
> about 1/3 of the total caching happening (presuming shared_buffers=1/4 RAM
> and effective_cache_size~=3/4 RAM). While in general it's nice to think
> that shared_buffers has a similar makeup to what the OS is caching, it's not
> hard to discover common cases where this would not be the case.
> Particularly given the VACUUM/seq scan ring-buffer improvements in 8.3,
> it's easy to imagine scanning a table that's 2*shared_buffers in size
> showing only 256KB in shared_buffers, while the whole thing is available in
> the OS cache.
>
> I had a eureka moment where I realized I could hook the buffer eviction code
> to model that. Decrement the count for that relation in the main top-k
> count, then have a second count that assumes the last 2*shared_buffers
> evicted are also still cached. That would accurately model the ring-buffer
> case and improve the quality of the model in general. Updating those stats
> on every eviction would add some overhead, but if the background writer is
> doing enough of them for you that should at least be asynchronous from when
> most backends are blocked waiting for an eviction.
>
> And that's as far as I got before I had to return to real work again.

This seems plausible, but I'm not totally sold: predicting the
contents of the operating system buffer cache sounds like it might be
pretty touch. And do we even need to go that far? I'm kind of
wondering whether we might be able to leverage the information that
the statistics collector already gathers for this purpose - in
particular, the information on blocks fetched and read. That might
not exactly model the current contents of the buffer cache, but it's
certainly a measure of popularity, and that may be all we really need.
We're not going to invalidate every plan in the system on every
buffer eviction, so plans have to be based not so much on what is in
the buffer cache right now but on what we have a reasonable
expectation of finding there in the typical case.

Consider, for example, the degenerate (but not necessarily uncommon)
case where the entire database can fit within shared_buffers, or
perhaps shared_buffers + OS cache. ISTM we're going to want to plan
as if the entire database is in cache all the time, even though that
might not always be true - right after restart, for example.

I kind of feel like the algorithm for predicting the cache hit ratio
ought to be some sort of ANALYZE-like process that runs periodically
and stores away a value for each table. When we run the algorithm, we
take a user-supplied estimate of the total cache available on the
machine (effective_cache_size, or some new GUC we invent for this
purpose) and divvy it up among all the relations in the database in
proportion to popularity score. That is, we divide the number of
popularity points for the most popular relation by the total of all
popularity values and assign that amount of cache to the most popular
relation (but not more than the relation size). We then repeat this
algorithm with the remaining relations and the remaining amount of
cache, and then approximate the cache hit ratio for each relation as
the allocated amount of cache divided by the relation size. To
prevent plans from changing too abruptly, the popularity function
should use some sort of sliding window.

Where you're going to run into trouble with any algorithm of this type
is databases that switch between multiple, distinct workloads. I have
no idea what to do about that problem. You might also run into
problems with relations that have "hot" segments that are accessed
frequently and stay cached, and "cold" segments that are never
touched: if 20% of the relation is in cache, but that's the only 20%
of the relation we ever access, then our hit rate will be 100% rather
than 20%.

But even a primitive algorithm would probably be a lot better than
what we have now. I'm guessing that there are a lot of databases where
either the whole database fits in cache, or a decent chunk of
relatively small core relations fit in cache and then there are some
big or infrequently-used ones that don't.

...Robert

From:	decibel <decibel(at)decibel(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-17 16:14:56
Message-ID:	00BD3A7D-CBEC-4B26-A23E-A0C83B8EB63D@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 15, 2009, at 9:54 PM, Robert Haas wrote:
> On Sun, Feb 15, 2009 at 1:16 PM, Greg Smith <gsmith(at)gregsmith(dot)com>
> wrote:
>> On Fri, 13 Feb 2009, Robert Haas wrote:
> This seems plausible, but I'm not totally sold: predicting the
> contents of the operating system buffer cache sounds like it might be
> pretty touch. And do we even need to go that far? I'm kind of
> wondering whether we might be able to leverage the information that
> the statistics collector already gathers for this purpose - in
> particular, the information on blocks fetched and read. That might
> not exactly model the current contents of the buffer cache, but it's
> certainly a measure of popularity, and that may be all we really need.
> We're not going to invalidate every plan in the system on every
> buffer eviction, so plans have to be based not so much on what is in
> the buffer cache right now but on what we have a reasonable
> expectation of finding there in the typical case.
>
> Consider, for example, the degenerate (but not necessarily uncommon)
> case where the entire database can fit within shared_buffers, or
> perhaps shared_buffers + OS cache. ISTM we're going to want to plan
> as if the entire database is in cache all the time, even though that
> might not always be true - right after restart, for example.

The shared_buffers + OS cache example is a reason why simply
examining shared_buffers isn't likely to work well; in that case it
definitely would not reflect reality. Though, really in that case we
should be able to simply look at eff_cache_size as well as the size
of the database and understand everything should be in memory.

Actually, a simple algorithm that might work really well would be to
calculate relation cache odds as ( number of page accesses for
relation / number of page accesses for all relations ) * ( sum
(relpages)*BLKSZ / eff_cache_size ), where number of page accesses
would be both from relcache and not. One thing this doesn't address
though is the report from a few months ago that accessing small
tables is still faster with an index scan, even if we know the whole
thing is in cache (I don't remember if that was ever resolved...)

Another idea would be to look at an efficient way to measure how long
it actually takes to pull data from the OS. This has been suggested
in the past, but the idea there was to measure every block access,
and the concern was the overhead of the timing calls. But what if we
sampled instead? Or, what if we read multiple blocks at one time in
the cases where we knew we'd need to (seqscan and an index scan
needing more than one tuple). Another option would by an async IO
process that is responsible for measuring this stuff; I believe some
people have played with async IO and gotten good results.

Of course, on dtrace platforms we could just plug into dtrace...

> You might also run into
> problems with relations that have "hot" segments that are accessed
> frequently and stay cached, and "cold" segments that are never
> touched: if 20% of the relation is in cache, but that's the only 20%
> of the relation we ever access, then our hit rate will be 100% rather
> than 20%.

Yes, but that would be accurate :)

In reality, I think we need to re-visit the idea of evaluating how
close a chosen query plan is matching reality as we're running. If we
thought we'd be seeing a 100% hit rate but in reality it's much lower
we could re-plan (of course this probably only makes sense for
queries that take many seconds).

> But even a primitive algorithm would probably be a lot better than
> what we have now. I'm guessing that there are a lot of databases where
> either the whole database fits in cache, or a decent chunk of
> relatively small core relations fit in cache and then there are some
> big or infrequently-used ones that don't.

--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	decibel <decibel(at)decibel(dot)org>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 05:23:37
Message-ID:	603c8f070902172123m3c424e9bxcf46ad4e4b380bf0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Actually, a simple algorithm that might work really well would be to
> calculate relation cache odds as ( number of page accesses for relation /
> number of page accesses for all relations ) * ( sum(relpages)*BLKSZ /
> eff_cache_size ), where number of page accesses would be both from relcache
> and not.

I don't think that formula makes any sense. If effective_cache_size
is in the denominator, then increasing it will make the odds of
finding the page in cache go down.

> One thing this doesn't address though is the report from a few
> months ago that accessing small tables is still faster with an index scan,
> even if we know the whole thing is in cache (I don't remember if that was
> ever resolved...)

I'm not sure if this is what you're referring to, but there was a
relatively recent post on, I believe, -performance, where a bitmap
index scan that hit almost the entire table beat out a seqscan. I
don't think there was any further discussion and I'm still mystified
as to how it's possible.

> Another idea would be to look at an efficient way to measure how long it
> actually takes to pull data from the OS. This has been suggested in the
> past, but the idea there was to measure every block access, and the concern
> was the overhead of the timing calls. But what if we sampled instead? Or,
> what if we read multiple blocks at one time in the cases where we knew we'd
> need to (seqscan and an index scan needing more than one tuple). Another
> option would by an async IO process that is responsible for measuring this
> stuff; I believe some people have played with async IO and gotten good
> results.
>
> Of course, on dtrace platforms we could just plug into dtrace...
>
>> You might also run into
>> problems with relations that have "hot" segments that are accessed
>> frequently and stay cached, and "cold" segments that are never
>> touched: if 20% of the relation is in cache, but that's the only 20%
>> of the relation we ever access, then our hit rate will be 100% rather
>> than 20%.
>
> Yes, but that would be accurate :)

No, we'd predict the hit rate to be 20%, but the real hit rate would be 100%.

> In reality, I think we need to re-visit the idea of evaluating how close a
> chosen query plan is matching reality as we're running. If we thought we'd
> be seeing a 100% hit rate but in reality it's much lower we could re-plan
> (of course this probably only makes sense for queries that take many
> seconds).

I don't think it's going to be very practical to re-plan the query in
its entirety, because then you'd have to somehow undo all of the work
you'd done thus far (including side effects, if any), which might not
be possible and certainly isn't easy. What might be practical is to
bail out of a nested loop that turns out to iterate more times than
expected and hash the inner rel, or even sort the remaining portion of
the outer rel and the entire inner rel and then merge-join them. The
problem is that these sorts of changes can alter the order in which
results are generated, and if the parent node is something like a
merge-join that needs the results to be ordered in a particular way,
then you've got a problem. Furthermore, it's not easy to decide when
to switch strategies. If you decide to switch to a hash join just
prior to what would have been the last iteration of the nested loop,
you lose.

I'm interested to know whether anyone else shares my belief that
nested loops are the cause of most really bad plans. What usually
happens to me is that the planner develops some unwarranted optimism
about the number of rows likely to be generated by the outer side of
the join and decides that it's not worth sorting the inner side or
building a hash table or using an index, and that the right thing to
do is just rescan the inner node on every pass. When the outer side
returns three or four orders of magnitude more results than expected,
ka-pow!

Another approach to this problem might be to try to make the planner a
little more cautious about choosing nested loops in the first place.
Given a column a with the values 1 .. 10^6, the planner estimates the
number of rows for a = X as 1, a in (X1..Xn) as n, a not in (X1..Xn)
AS 10^6-n, and a < X for all X < 100 as 100. These are all pretty
reasonable estimates (one could hope to get a different result for a <
5 than a < 100). But as soon as you use some operation that the
planner knows nothing about, the guesses get really bad:

CREATE TABLE big (id serial, x text);
INSERT INTO big (x) SELECT random() FROM generate_series(1,1000000);
ANALYZE;
EXPLAIN SELECT * FROM big WHERE id % 2 = 0 AND (id + 0) % 2 = 0 AND
(id - 0) % 2 = 0;

QUERY PLAN
------------------------------------------------------------------------------
Seq Scan on big (cost=0.00..36375.00 rows=1 width=22)
Filter: (((id % 2) = 0) AND (((id + 0) % 2) = 0) AND (((id - 0) % 2) = 0))

The fact that the selectivity of an unknown expression is arbitrarily
set to 0.005 is a defensible choice that doesn't happen to match my
experience, but the idea that with three unknown expressions the
selectivity is now (0.005)^3 = 0.000000125 seems hard to justify.
That's a sufficiently small selectivity that it will often result in
nestloop plans, and they tend not to be good when the real selectivity
turns out to be, say, 0.1.

>> But even a primitive algorithm would probably be a lot better than
>> what we have now. I'm guessing that there are a lot of databases where
>> either the whole database fits in cache, or a decent chunk of
>> relatively small core relations fit in cache and then there are some
>> big or infrequently-used ones that don't.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 06:34:25
Message-ID:	807.1234938865@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm interested to know whether anyone else shares my belief that
> nested loops are the cause of most really bad plans. What usually
> happens to me is that the planner develops some unwarranted optimism
> about the number of rows likely to be generated by the outer side of
> the join and decides that it's not worth sorting the inner side or
> building a hash table or using an index, and that the right thing to
> do is just rescan the inner node on every pass. When the outer side
> returns three or four orders of magnitude more results than expected,
> ka-pow!

And then there is the other half of the world, who complain because it
*didn't* pick a nestloop for some query that would have run in much less
time if it had.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 12:50:40
Message-ID:	603c8f070902180450q653d0a2mc249607b2fe9bf64@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 18, 2009 at 1:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm interested to know whether anyone else shares my belief that
>> nested loops are the cause of most really bad plans. What usually
>> happens to me is that the planner develops some unwarranted optimism
>> about the number of rows likely to be generated by the outer side of
>> the join and decides that it's not worth sorting the inner side or
>> building a hash table or using an index, and that the right thing to
>> do is just rescan the inner node on every pass. When the outer side
>> returns three or four orders of magnitude more results than expected,
>> ka-pow!
>
> And then there is the other half of the world, who complain because it
> *didn't* pick a nestloop for some query that would have run in much less
> time if it had.

Well, that's my question: is that really the other half of the world,
or is it the other 5% of the world? And how does it happen? In my
experience, most bad plans are caused by bad selectivity estimates,
and the #1 source of bad selectivity estimates is selectivity
estimates for unknown expressions.

(Now it appears that Josh is having problems that are caused by
overestimating the cost of a page fetch, perhaps due to caching
effects. Those are discussed upthread, and I'm still interested to
see whether we can arrive at any sort of consensus about what might be
a reasonable approach to attacking that problem. My own experience
has been that this problem is not quite as bad, because it can throw
the cost off by a factor of 5, but not by a factor of 800,000, as in
my example of three unknown expressions with a combined selectivity of
0.1.)

...Robert

From:	Sam Mason <sam(at)samason(dot)me(dot)uk>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 15:13:44
Message-ID:	20090218151344.GQ32672@frubble.xen.chris-lamb.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 18, 2009 at 01:34:25AM -0500, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I'm interested to know whether anyone else shares my belief that
> > nested loops are the cause of most really bad plans. What usually
> > happens to me is that the planner develops some unwarranted optimism
> > about the number of rows likely to be generated by the outer side of
> > the join and decides that it's not worth sorting the inner side or
> > building a hash table or using an index, and that the right thing to
> > do is just rescan the inner node on every pass. When the outer side
> > returns three or four orders of magnitude more results than expected,
> > ka-pow!
>
> And then there is the other half of the world, who complain because it
> *didn't* pick a nestloop for some query that would have run in much less
> time if it had.

There's something called "interval arithmetic" I heard about recently
that may help here. The basic idea is to store two values, representing
the upper and lower bounds of a number, instead of its absolute value.
That way you know that the number is going to be somewhere in the middle
and round off becomes a non-less because you know when it's happened
(e.g. the FPU is set up to always round the lower bound down and the
upper bound up). Round-off isn't a problem here, but it's one of the
algebra's nice properties.

If the planning was done with some sort of interval then you'd be
able to encode information about how well your stats characterized the
underlying data. Traditionally awkward things like amount of cache
would serve to drop the lower bound, but not alter the upper. The
planner then automatically propagate performance information through the
calculations, i.e. a nested loop with a tight estimate on a small number
of rows joined to a table with a wider estimate of a small number of
rows would keep the low lower bound but the upper-bound would tend to
make the planner stay away.

That said, I can't decide if that would help in any meaningful way! A
good counter argument seems to be why not just always go with the upper
bound? There are extensions to vanilla interval arithmetic that allow
distributions to be modeled instead of just an unknown interval. You'd
then be able to use some sort of 95% confidence interval instead of a
horribly pessimistic worst case.

--
Sam http://samason.me.uk/

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 15:57:52
Message-ID:	603c8f070902180757m15b0bd23ib7256cc8ef44447e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> If the planning was done with some sort of interval then you'd be
> able to encode information about how well your stats characterized the
> underlying data. Traditionally awkward things like amount of cache
> would serve to drop the lower bound, but not alter the upper. The
> planner then automatically propagate performance information through the
> calculations, i.e. a nested loop with a tight estimate on a small number
> of rows joined to a table with a wider estimate of a small number of
> rows would keep the low lower bound but the upper-bound would tend to
> make the planner stay away.

Yeah, I thought about this too, but it seems like overkill for the
problem at hand, and as you say it's not clear you'd get any benefit
out of the upper bound anyway. I was thinking of something simpler:
instead of directly multiplying 0.005 into the selectivity every time
you find something incomprehensible, keep a count of the number of
incomprehensible things you saw and at the end multiply by 0.005/N.
That way more unknown quals look more restrictive than fewer, but
things only get linearly wacky instead of exponentially wacky.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Sam Mason <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 16:46:41
Message-ID:	8393.1234975601@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Yeah, I thought about this too, but it seems like overkill for the
> problem at hand, and as you say it's not clear you'd get any benefit
> out of the upper bound anyway. I was thinking of something simpler:
> instead of directly multiplying 0.005 into the selectivity every time
> you find something incomprehensible, keep a count of the number of
> incomprehensible things you saw and at the end multiply by 0.005/N.
> That way more unknown quals look more restrictive than fewer, but
> things only get linearly wacky instead of exponentially wacky.

clauselist_selectivity could perhaps apply such a heuristic, although
I'm not sure how it could recognize "default" estimates from the various
specific estimators, since they're mostly all different.

Personally I've not seen all that many practical cases where the
estimator simply hasn't got a clue at all. What's far more commonly
complained of IME is failure to handle *correlated* conditions in
an accurate fashion. Maybe we should just discount the product
selectivity all the time, not only when we think the components are
default estimates.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 17:33:38
Message-ID:	1234978418.1775.13.camel@jd-laptop.pragmaticzealot.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-02-18 at 07:50 -0500, Robert Haas wrote:

> (Now it appears that Josh is having problems that are caused by
> overestimating the cost of a page fetch, perhaps due to caching
> effects. Those are discussed upthread, and I'm still interested to
> see whether we can arrive at any sort of consensus about what might be
> a reasonable approach to attacking that problem. My own experience
> has been that this problem is not quite as bad, because it can throw
> the cost off by a factor of 5, but not by a factor of 800,000, as in
> my example of three unknown expressions with a combined selectivity of
> 0.1.)
>

Well a very big problem with any solution is that we are creating a
solution for a 2% problem. 98% of the postgresql installations out there
will never need to adjust cpu_tuple_cost, cpu_index_tuple_cost,
cpu_operator_cost, random_page_cost etc... They can get by just fine
with a tweak to shared_buffers, work_mem, effective_cache_size and
default_statistics_target.

What I think should happen is to do some testing one "normal" installs
and see if upping those parameters to .5 (or other amount) hinders those
98% installs. If it doesn't hinder those then we should up the default
and walk away.

Joshua D. Drake

> ...Robert
>
--
PostgreSQL - XMPP: jdrake(at)jabber(dot)postgresql(dot)org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 19:46:32
Message-ID:	499C6598.9020604@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> experience, most bad plans are caused by bad selectivity estimates,
> and the #1 source of bad selectivity estimates is selectivity
> estimates for unknown expressions.

ISTM unknown expressions should be modeled as a range of
values rather than one single arbitrary value.

For example, rather than just guessing 1000 rows, if an
unknown expression picked a wide range (say, 100 - 10000
rows; or maybe even 1 - table_size), the planner could
choose a plan which wouldn't be pathologically slow
regardless of if the guess was too low or too high.

For that matter, it seems if all estimates used a range
rather than a single value, ISTM less in general we would
product less fragile plans.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 20:13:14
Message-ID:	603c8f070902181213q1ad6693ci6169ad144eeb5f47@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 18, 2009 at 2:46 PM, Ron Mayer
<rm_pg(at)cheapcomplexdevices(dot)com> wrote:
> Robert Haas wrote:
>> experience, most bad plans are caused by bad selectivity estimates,
>> and the #1 source of bad selectivity estimates is selectivity
>> estimates for unknown expressions.
>
> ISTM unknown expressions should be modeled as a range of
> values rather than one single arbitrary value.
>
> For example, rather than just guessing 1000 rows, if an
> unknown expression picked a wide range (say, 100 - 10000
> rows; or maybe even 1 - table_size), the planner could
> choose a plan which wouldn't be pathologically slow
> regardless of if the guess was too low or too high.
>
> For that matter, it seems if all estimates used a range
> rather than a single value, ISTM less in general we would
> product less fragile plans.

It would be interesting to find out if something like this could be
made to work, but it's more than I'd be willing to bite off. I think
this would require reworking large portions of the planner, and I am
doubtful that it could be done without a substantial loss of
performance. The existing code considers A LOT of plans, to the point
where even a few more or fewer floating-point operations per plan
result in a measurable change in planning time that can be measured in
macro-benchmarks.

If we could somehow tamp down the amount of time considering plans
that turn out to be dead ends, it might free up some time to perform
some of these other computations. But I'm not sure how to go about
that. The best ideas I've come up with so far involve refactoring
joinpath.c to eliminate some of the duplicate computation and/or
somehow be more intelligent about which nested loops we generate. But
I haven't come up with anything yet that's demonstrably better than
the add_path patch that I submitted a few weeks ago, which is not bad
but not earth-shattering either. At any rate, we'd need to save quite
a bit to pay for carting around best and worst case costs for every
plan we consider.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 20:32:53
Message-ID:	21655.1234989173@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ... At any rate, we'd need to save quite
> a bit to pay for carting around best and worst case costs for every
> plan we consider.

Another problem with this is it doesn't really do anything to solve the
problem we were just discussing, namely having an intelligent way of
combining inaccurate estimates for WHERE clauses. If you just take a
range of plausible values and multiply then it doesn't take very many
clauses to get to a range of [0,1] --- or at least a range of
probabilities wide enough to be unhelpful.

An idea that I think has been mentioned before is to try to identify
cases where we can *prove* there is at most one row emitted by a
sub-path (eg, because of a unique index, DISTINCT subplan, etc). Then
we could penalize nestloops with outer relations that weren't provably a
single row. This is basically restricting the notion of estimation
confidence to a special case that's particularly important for SQL.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 20:50:28
Message-ID:	603c8f070902181250q7d88de9as479e5ea379733c94@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 18, 2009 at 3:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> ... At any rate, we'd need to save quite
>> a bit to pay for carting around best and worst case costs for every
>> plan we consider.
>
> Another problem with this is it doesn't really do anything to solve the
> problem we were just discussing, namely having an intelligent way of
> combining inaccurate estimates for WHERE clauses. If you just take a
> range of plausible values and multiply then it doesn't take very many
> clauses to get to a range of [0,1] --- or at least a range of
> probabilities wide enough to be unhelpful.

Yeah.

> An idea that I think has been mentioned before is to try to identify
> cases where we can *prove* there is at most one row emitted by a
> sub-path (eg, because of a unique index, DISTINCT subplan, etc). Then
> we could penalize nestloops with outer relations that weren't provably a
> single row. This is basically restricting the notion of estimation
> confidence to a special case that's particularly important for SQL.

I thought about this, too, and I agree. Having this information
available would also be very helpful for join removal. I believe that
you did some work on this for SEMI/ANTI-join support in the form of
query_is_distinct_for, but I'm not sure if that takes the right sort
of inputs for what we need here. (It also doesn't seem to consider
the case of a baserel with a unique index for some reason...)

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 22:22:05
Message-ID:	1234995725.3973.161.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-02-18 at 15:32 -0500, Tom Lane wrote:

Proof seems best way forward. IIRC the reason we didn't do this before
HOT is that unique index scans did often return many more than one row.
Now we have a much better chance of it being true.

As you say, propagation of error makes an error bars approach pointless
too quickly to be worth pursuing.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, decibel <decibel(at)decibel(dot)org>, Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-18 23:37:08
Message-ID:	27636.1235000228@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, 2009-02-18 at 15:32 -0500, Tom Lane wrote:
>> An idea that I think has been mentioned before is to try to identify
>> cases where we can *prove* there is at most one row emitted by a
>> sub-path (eg, because of a unique index, DISTINCT subplan, etc).

> Proof seems best way forward. IIRC the reason we didn't do this before
> HOT is that unique index scans did often return many more than one row.

But those extra tuples didn't make it up to the point of the join, so
they weren't really a problem for nestloop speed. IMO the reason this
hasn't been done to date is that until recently we didn't have any
mechanism to ensure a plan got invalidated if some constraint it was
depending on (like uniqueness) went away. Now we do, so it would be
safe to rely on the constraint for proof purposes.

regards, tom lane

From:	decibel <decibel(at)decibel(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-21 00:25:41
Message-ID:	9D119CC4-0DB8-4144-A109-3032D9E77DB3@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 17, 2009, at 11:23 PM, Robert Haas wrote:
>> Actually, a simple algorithm that might work really well would be to
>> calculate relation cache odds as ( number of page accesses for
>> relation /
>> number of page accesses for all relations ) * ( sum(relpages)*BLKSZ /
>> eff_cache_size ), where number of page accesses would be both from
>> relcache
>> and not.
>
> I don't think that formula makes any sense. If effective_cache_size
> is in the denominator, then increasing it will make the odds of
> finding the page in cache go down.

Yes, sorry... I got that part of the equation upside-down. It should be:

( number of page accesses for relation / number of page accesses for
all relations ) * ( eff_cache_size / sum(relpages)*BLKSZ )

>> One thing this doesn't address though is the report from a few
>> months ago that accessing small tables is still faster with an
>> index scan,
>> even if we know the whole thing is in cache (I don't remember if
>> that was
>> ever resolved...)
>
> I'm not sure if this is what you're referring to, but there was a
> relatively recent post on, I believe, -performance, where a bitmap
> index scan that hit almost the entire table beat out a seqscan. I
> don't think there was any further discussion and I'm still mystified
> as to how it's possible.

What I was thinking of was that when dealing with a very small table
(one or maybe a few pages), the planner thinks that a seqscan is the
fastest way to get a single row, but it's actually faster to use an
index. The bitmap case is even more interesting. Something is
seriously screwy with small seqscans it seems.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	decibel <decibel(at)decibel(dot)org>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, jd(at)commandprompt(dot)com, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: The science of optimization in practical terms?
Date:	2009-02-21 02:15:16
Message-ID:	603c8f070902201815l21e71c43yc5340ac3f725cd01@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 20, 2009 at 7:25 PM, decibel <decibel(at)decibel(dot)org> wrote:
> On Feb 17, 2009, at 11:23 PM, Robert Haas wrote:
>>>
>>> Actually, a simple algorithm that might work really well would be to
>>> calculate relation cache odds as ( number of page accesses for relation /
>>> number of page accesses for all relations ) * ( sum(relpages)*BLKSZ /
>>> eff_cache_size ), where number of page accesses would be both from
>>> relcache
>>> and not.
>>
>> I don't think that formula makes any sense. If effective_cache_size
>> is in the denominator, then increasing it will make the odds of
>> finding the page in cache go down.
>
> Yes, sorry... I got that part of the equation upside-down. It should be:
>
> ( number of page accesses for relation / number of page accesses for all
> relations ) * ( eff_cache_size / sum(relpages)*BLKSZ )

Well, that makes more sense, but it's still not right. Suppose I have
ten equal-sized relations whose total size is equal to
effective_cache_size. Relations 1-5 each get 15% of the page accesses
and relations 6-10 each get 5% of the page accesses. Under your
formula, relations 1-5 will be 150% in cache and relations 6-10 will
be 50% in cache. In reality, assuming sufficient frequency of access,
100% of all ten relations will be in cache.

I don't think there's any way to do this that doesn't involve some
sort of iterative process. What you need to do is compute (# of page
accesses for this relation / number of page accesses for all
relations) * effective_cache_size, dole out that amount of cache to it
(capped at 100% of the relation size), and then decrement
effective_cache_size by the amount of cache you doled out and the
number of page accesses by the number for that relation, and then
rerun for the second-most-popular relation.

For example, suppose (in the example above) that effective_cache_size
= 1GB and there are 10K page accesses total.

Relation 1: MAX(1.5K/10K * 1GB, 100MB) = MAX(150MB, 100MB) = 100MB
Relation 2: MAX(1.5K/8.5K * 900MB, 100MB) = MAX(159MB, 100MB) = 100MB
Relation 3: MAX(1.5K/7K * 800MB, 100MB) = MAX(171MB, 100MB) = 100MB
Relation 4: MAX(1.5K/5.5K * 700MB, 100MB) = MAX(190MB, 100MB) = 100MB
Relation 5: MAX(1.5K/4K * 600MB, 100MB) = MAX(225MB, 100MB) = 100MB
Relation 6: MAX(0.5K/2.5K * 500MB, 100MB) = MAX(100MB, 100MB) = 100MB
Relation 7: MAX(0.5K/2.0K * 400MB, 100MB) = MAX(100MB, 100MB) = 100MB
Relation 8: MAX(0.5K/1.5K * 300MB, 100MB) = MAX(100MB, 100MB) = 100MB
Relation 9: MAX(0.5K/1.0K * 200MB, 100MB) = MAX(100MB, 100MB) = 100MB
Relation 10: MAX(0.5K/0.5K * 100MB, 100MB) = MAX(100MB, 100MB) = 100MB

>>> One thing this doesn't address though is the report from a few
>>> months ago that accessing small tables is still faster with an index
>>> scan,
>>> even if we know the whole thing is in cache (I don't remember if that was
>>> ever resolved...)
>>
>> I'm not sure if this is what you're referring to, but there was a
>> relatively recent post on, I believe, -performance, where a bitmap
>> index scan that hit almost the entire table beat out a seqscan. I
>> don't think there was any further discussion and I'm still mystified
>> as to how it's possible.
>
> What I was thinking of was that when dealing with a very small table (one or
> maybe a few pages), the planner thinks that a seqscan is the fastest way to
> get a single row, but it's actually faster to use an index. The bitmap case
> is even more interesting. Something is seriously screwy with small seqscans
> it seems.

Do you have a good test case for this? I'd like to poke at it. It's
really just because the planner thinks that accessing the index pages
will cost a disk read, which is often false in practice. Does it help
if you set random_page_cost = seq_page_cost = 0.2?

The case I mentioned is qualitatively different because not only is
the planner wrong, but the observed behavior is somewhat inexplicable.
I have a feeling this may have to do with the fact that bitmap
indices can identify individual tuples on the page when the tbm is
non-lossy. Consulting the index (which is almost free if the page is
already in shared_buffers) not only finds the right page, but lets you
skip the CPU overhead of testing the quals against the irrelevant
tuples on that page. But we need to do some better testing here to
figure out what is really going on.

...Robert