Re: Better default_statistics_target

Lists: pgsql-hackerspgsql-patches
From: Greg Sabino Mullane <greg(at)turnstep(dot)com>
To: pgsql-patches(at)postgresql(dot)org
Subject: Better default_statistics_target
Date: 2007-11-14 04:38:08
Message-ID: 1195015088.24501.58.camel@greg-laptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Per a recent bug in which the planner can behave very differently at <
100, and accounting for the fact that analyze is still plenty fast on
today's systems even at a tenfold increase, attached is a patch to
change default_statistics_target from 10 to 100.

Attachment Content-Type Size
better_default_stats.patch text/x-patch 3.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-11-14 05:00:06
Message-ID: 13872.1195016406@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Sabino Mullane <greg(at)turnstep(dot)com> writes:
> Per a recent bug in which the planner can behave very differently at <
> 100, and accounting for the fact that analyze is still plenty fast on
> today's systems even at a tenfold increase, attached is a patch to
> change default_statistics_target from 10 to 100.

This is not happening without a whole lot more evidence (as in, more
than zero) to back up the choice of value.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-11-18 14:42:54
Message-ID: 1195396974.4217.13.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 2007-11-14 at 00:00 -0500, Tom Lane wrote:
> Greg Sabino Mullane <greg(at)turnstep(dot)com> writes:
> > Per a recent bug in which the planner can behave very differently at <
> > 100, and accounting for the fact that analyze is still plenty fast on
> > today's systems even at a tenfold increase, attached is a patch to
> > change default_statistics_target from 10 to 100.
>
> This is not happening without a whole lot more evidence (as in, more
> than zero) to back up the choice of value.

The choice of 100 is because of the way the LIKE estimator is
configured. Greg is not suggesting he measured it and found 100 to be
best, he is saying that the LIKE operator is hard-coded at 100 and so
the stats_target should reflect that.

Setting it to 100 for all columns because of LIKE doesn't make much
sense. I think we should set stats target differently depending upon the
data type, but thats probably an 8.4 thing. Long text fields that might
use LIKE should be set to 100. CHAR(1) and general fields should be set
to 10.

Two thoughts:

- why did we pick 100 for the LIKE operator?

- should we document the better selectivity for LIKE operators at 100?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Better default_statistics_target
Date: 2007-11-19 08:15:17
Message-ID: 87ejemvega.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> Long text fields that might use LIKE should be set to 100. CHAR(1) and
> general fields should be set to 10.

I could see arguing that either way. Longer fields are capable of more
precision and so may need more buckets to predict. On the other hand longer
fields take more space and take longer to compare so to make consistent use of
resources you would want to avoid storing and comparing large numbers of them
whereas you could afford much larger targets for small quick columns.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Matteo Beccati <php(at)beccati(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-11-21 16:30:37
Message-ID: 47445D2D.1060305@beccati.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hi,

> - why did we pick 100 for the LIKE operator?

http://archives.postgresql.org/pgsql-hackers/2006-09/msg01715.php

> - should we document the better selectivity for LIKE operators at 100?

Probably... it I'm the only one who knows about it ;)

--
Matteo Beccati

Openads - http://www.openads.org


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-patches(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-12-05 14:26:17
Message-ID: dda58892d32fdbc418e1639a04f3c9b4@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Simon spoke:
> The choice of 100 is because of the way the LIKE estimator is
> configured. Greg is not suggesting he measured it and found 100 to be
> best, he is saying that the LIKE operator is hard-coded at 100 and so
> the stats_target should reflect that.

Exactly.

> Setting it to 100 for all columns because of LIKE doesn't make much
> sense. I think we should set stats target differently depending upon the
> data type, but thats probably an 8.4 thing. Long text fields that might
> use LIKE should be set to 100. CHAR(1) and general fields should be set
> to 10.

Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
there a reason not to make this change? I know I've been lazy and not run
any absolute figures, but rough tests show that raising it (from 10 to
100) results in a very minor increase in analyze time, even for large
databases. I think the burden of a slightly slower analyze time, which
can be easily adjusted, both in postgresql.conf and right before running
an analyze, is very small compared to the pain of some queries - which worked
before - suddenly running much, much slower for no apparent reason at all.
Sure, 100 may have been chosen somewhat arbitrarily for the LIKE thing,
but this is a current real-world performance regression (aka a bug,
according to a nearby thread). Almost everyone agrees that 10 is too low,
so why not make it 100, throw a big warning in the release notes, and
then start some serious re-evaluation for 8.4?

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
End Point Corporation
PGP Key: 0x14964AC8 200712050920
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFHVrSivJuQZxSWSsgRAyDNAKCInH9SJRO8ly1L1MomJUPlBslBlgCeLQ1v
+w4ZumRcB5U5L3SGT0rk4AE=
=I8Ur
-----END PGP SIGNATURE-----


From: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-12-05 17:49:00
Message-ID: 1d4e0c10712050949q783d0084u10f1313b3750414a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
> there a reason not to make this change? I know I've been lazy and not run
> any absolute figures, but rough tests show that raising it (from 10 to
> 100) results in a very minor increase in analyze time, even for large
> databases. I think the burden of a slightly slower analyze time, which
> can be easily adjusted, both in postgresql.conf and right before running
> an analyze, is very small compared to the pain of some queries - which worked
> before - suddenly running much, much slower for no apparent reason at all.

As Tom stated it earlier, the ANALYZE slow down is far from being the
only consequence. The planner will also have more work to do and
that's the hard point IMHO.

Without studying the impacts of this change on a large set of queries
in different cases, it's quite hard to know for sure that it won't
have a negative impact in a lot of cases.

It's a bit too late in the cycle to change that IMHO, especially
without any numbers.

--
Guillaume


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-12-05 20:13:48
Message-ID: 608x48yk5f.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

guillaume(dot)smet(at)gmail(dot)com ("Guillaume Smet") writes:
> On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
>> Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
>> there a reason not to make this change? I know I've been lazy and not run
>> any absolute figures, but rough tests show that raising it (from 10 to
>> 100) results in a very minor increase in analyze time, even for large
>> databases. I think the burden of a slightly slower analyze time, which
>> can be easily adjusted, both in postgresql.conf and right before running
>> an analyze, is very small compared to the pain of some queries - which worked
>> before - suddenly running much, much slower for no apparent reason at all.
>
> As Tom stated it earlier, the ANALYZE slow down is far from being the
> only consequence. The planner will also have more work to do and
> that's the hard point IMHO.
>
> Without studying the impacts of this change on a large set of queries
> in different cases, it's quite hard to know for sure that it won't
> have a negative impact in a lot of cases.
>
> It's a bit too late in the cycle to change that IMHO, especially
> without any numbers.

I have the theory (thus far not borne out by any numbers) that it
might be a useful approach to try to go through the DB schema and use
what information is there to try to come up with better numbers on a
per-column basis.

As a "first order" perspective on things:

- Any columns marked "unique" could keep to having somewhat smaller
numbers of bins in the histogram because we know that uniqueness
will keep values dispersed at least somewhat.

Ditto for "SERIAL" types.

- Columns NOT marked unique should imply adding some bins to the
histogram.

- Datestamps tend to imply temporal dispersion, ergo "somewhat fewer
bins." Similar for floats.

- Discrete values (integer, text) frequently see less dispersion,
-> "more bins"

Then could come a "second order" perspective, where data would
actually get sampled from pg_statistics.

- If we look at the number of distinct histogram bins used, for a
particular column, and find that there are some not used, we might
drop bins.

- We might try doing some summary statistics to see how many unique
values there actually are, on each column, and increase the number of
bins if they're all in use, and there are other values that *are*
frequently used.

Maybe cheaper, if we find that pg_statistics tells us that all bins
are in use, and extrapolation shows that there's a lot of the table
NOT represented, we increase the number of bins.

There might even be a "third order" analysis, where you'd try to
collect additional data from the table, and analytically try to
determine appropriate numbers of bins...

Thus, we don't have a universal increase in the amount of statistics
collected - the added stats are localized to places where there is
some reason to imagine them useful.
--
let name="cbbrowne" and tld="acm.org" in String.concat "@" [name;tld];;
http://cbbrowne.com/info/nonrdbms.html
There was a young lady of Crewe
Whose limericks stopped at line two.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Chris Browne" <cbbrowne(at)acm(dot)org>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Better default_statistics_target
Date: 2007-12-06 00:00:30
Message-ID: 878x484rq9.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


"Chris Browne" <cbbrowne(at)acm(dot)org> writes:

> - Any columns marked "unique" could keep to having somewhat smaller
> numbers of bins in the histogram because we know that uniqueness
> will keep values dispersed at least somewhat.

I think you're on the wrong track. It's not dispersal that's significant but
how evenly the values are dispersed. If the values are evenly spread
throughout the region from low to high bound then we just need the single
bucket telling us the low and high bound and how many values there are. If
they're unevenly distributed then we need enough buckets to be able to
distinguish the dense areas from the sparse areas.

Perhaps something like starting with 1 bucket, splitting it into 2, seeing if
the distributions are similar in which case we stop. If not repeat for each
bucket.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Decibel! <decibel(at)decibel(dot)org>
To: Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>
Cc: Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2007-12-06 18:28:13
Message-ID: 20071206182812.GY59974@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, Dec 05, 2007 at 06:49:00PM +0100, Guillaume Smet wrote:
> On Dec 5, 2007 3:26 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
> > Agreed, this would be a nice 8.4 thing. But what about 8.3 and 8.2? Is
> > there a reason not to make this change? I know I've been lazy and not run
> > any absolute figures, but rough tests show that raising it (from 10 to
> > 100) results in a very minor increase in analyze time, even for large
> > databases. I think the burden of a slightly slower analyze time, which
> > can be easily adjusted, both in postgresql.conf and right before running
> > an analyze, is very small compared to the pain of some queries - which worked
> > before - suddenly running much, much slower for no apparent reason at all.
>
> As Tom stated it earlier, the ANALYZE slow down is far from being the
> only consequence. The planner will also have more work to do and
> that's the hard point IMHO.

How much more? Doesn't it now use a binary search? If so, ISTM that
going from 10 to 100 would at worst double the time spent finding the
bucket we need. Considering that we're talking something that takes
microseconds, and that there's a huge penalty to be paid if you have bad
stats estimates, that doesn't seem that big a deal. And on modern
machines it's not like the additional space in the catalogs is going to
kill us.

FWIW, I've never seen anything but a performance increase or no change
when going from 10 to 100. In most cases there's a noticeable
improvement since it's common to have over 100k rows in a table, and
there's just no way to capture any kind of a real picture of that with
only 10 buckets.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2007-12-06 19:34:42
Message-ID: 1196969682.4255.455.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 2007-12-05 at 15:13 -0500, Chris Browne wrote:

> I have the theory (thus far not borne out by any numbers) that it
> might be a useful approach to try to go through the DB schema and use
> what information is there to try to come up with better numbers on a
> per-column basis.

Yeh, agreed.

The difficulty is making this work for generic datatypes.

> - Datestamps tend to imply temporal dispersion, ergo "somewhat fewer
> bins." Similar for floats.

Hmmm, not sure about that one. Some date/time columns can change very
quickly over time, so the stats are frequently out of date.

> Then could come a "second order" perspective, where data would
> actually get sampled from pg_statistics.
>
> - If we look at the number of distinct histogram bins used, for a
> particular column, and find that there are some not used, we might
> drop bins.

The histograms are height balanced, so they are always all used.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: "Christopher Browne" <cbbrowne(at)gmail(dot)com>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-28 23:14:05
Message-ID: d6d6637f0801281514v18c5119cwa375774f100760d8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Dec 6, 2007 6:28 PM, Decibel! <decibel(at)decibel(dot)org> wrote:
> FWIW, I've never seen anything but a performance increase or no change
> when going from 10 to 100. In most cases there's a noticeable
> improvement since it's common to have over 100k rows in a table, and
> there's just no way to capture any kind of a real picture of that with
> only 10 buckets.

I'd be more inclined to try to do something that was at least somewhat
data aware.

The "interesting theory" that I'd like to verify if I had a chance
would be to run through a by-column tuning using a set of heuristics.
My "first order approximation" would be:

- If a column defines a unique key, then we know there will be no
clustering of values, so no need to increase the count...

- If a column contains a datestamp, then the distribution of values is
likely to be temporal, so no need to increase the count...

- If a column has a highly constricted set of values (e.g. - boolean),
then we might *decrease* the count.

- We might run a query that runs across the table, looking at
frequencies of values, and if it finds a lot of repeated values, we'd
increase the count.

That's a bit "hand-wavy," but that could lead to both increases and
decreases in the histogram sizes. Given that, we can expect the
overall stat sizes to not forcibly need to grow *enormously*, because
we can hope for there to be cases of shrinkage.

--
http://linuxfinances.info/info/linuxdistributions.html
"The definition of insanity is doing the same thing over and over and
expecting different results." -- assortedly attributed to Albert
Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling


From: Decibel! <decibel(at)decibel(dot)org>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-30 22:58:48
Message-ID: 20080130225848.GE1212@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, Jan 28, 2008 at 11:14:05PM +0000, Christopher Browne wrote:
> On Dec 6, 2007 6:28 PM, Decibel! <decibel(at)decibel(dot)org> wrote:
> > FWIW, I've never seen anything but a performance increase or no change
> > when going from 10 to 100. In most cases there's a noticeable
> > improvement since it's common to have over 100k rows in a table, and
> > there's just no way to capture any kind of a real picture of that with
> > only 10 buckets.
>
> I'd be more inclined to try to do something that was at least somewhat
> data aware.
>
> The "interesting theory" that I'd like to verify if I had a chance
> would be to run through a by-column tuning using a set of heuristics.
> My "first order approximation" would be:
>
> - If a column defines a unique key, then we know there will be no
> clustering of values, so no need to increase the count...
>
> - If a column contains a datestamp, then the distribution of values is
> likely to be temporal, so no need to increase the count...
>
> - If a column has a highly constricted set of values (e.g. - boolean),
> then we might *decrease* the count.
>
> - We might run a query that runs across the table, looking at
> frequencies of values, and if it finds a lot of repeated values, we'd
> increase the count.
>
> That's a bit "hand-wavy," but that could lead to both increases and
> decreases in the histogram sizes. Given that, we can expect the
> overall stat sizes to not forcibly need to grow *enormously*, because
> we can hope for there to be cases of shrinkage.

I think that before doing any of that you'd be much better off
investigating how much performance penalty there is for maxing out
default_statistict_target. If, as I suspect, it's essentially 0 on
modern hardware, then I don't think it's worth any more effort.

BTW, that investigation wouldn't just be academic either; if we could
convince ourselves that there normally wasn't any cost associated with a
high default_statistics_target, we could increase the default, which
would reduce the amount of traffic we'd see on -performance about bad
query plans.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-30 23:08:45
Message-ID: 873asevrlu.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Decibel!" <decibel(at)decibel(dot)org> writes:

> I think that before doing any of that you'd be much better off
> investigating how much performance penalty there is for maxing out
> default_statistict_target. If, as I suspect, it's essentially 0 on
> modern hardware, then I don't think it's worth any more effort.

That's not my experience. Even just raising it to 100 multiplies the number of
rows ANALYZE has to read by 10. And the arrays for every column become ten
times larger. Eventually they start being toasted...

> BTW, that investigation wouldn't just be academic either; if we could
> convince ourselves that there normally wasn't any cost associated with a
> high default_statistics_target, we could increase the default, which
> would reduce the amount of traffic we'd see on -performance about bad
> query plans.

I suspect we could raise it, we just don't know by how much.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: Decibel! <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-30 23:36:32
Message-ID: 1d4e0c10801301536n818fac7kd0a05a7060db131e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Jan 31, 2008 12:08 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Decibel!" <decibel(at)decibel(dot)org> writes:
>
> > I think that before doing any of that you'd be much better off
> > investigating how much performance penalty there is for maxing out
> > default_statistict_target. If, as I suspect, it's essentially 0 on
> > modern hardware, then I don't think it's worth any more effort.
>
> That's not my experience. Even just raising it to 100 multiplies the number of
> rows ANALYZE has to read by 10. And the arrays for every column become ten
> times larger. Eventually they start being toasted...

+1. From the tests I did on our new server, I set the
default_statistict_target to 30. Those tests were mainly based on the
ANALYZE time though, not the planner overhead introduced by larger
statistics - with higher values, I considered the ANALYZE time too
high for the benefits. I set it higher on a per column basis only if I
see it can lead to better stats but from all the tests I did so far,
it was sufficient for our data set.

--
Guillaume


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, Decibel! <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 00:19:53
Message-ID: 27192.1201738793@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com> writes:
> On Jan 31, 2008 12:08 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>> That's not my experience. Even just raising it to 100 multiplies the number of
>> rows ANALYZE has to read by 10. And the arrays for every column become ten
>> times larger. Eventually they start being toasted...

> +1. From the tests I did on our new server, I set the
> default_statistict_target to 30. Those tests were mainly based on the
> ANALYZE time though, not the planner overhead introduced by larger
> statistics - with higher values, I considered the ANALYZE time too
> high for the benefits.

eqjoinsel(), for one, is O(N^2) in the number of MCV values kept.
Possibly this could be improved, but in general I'd be real wary
of pushing the default to the moon without some explicit testing of
the impact on planning time.

regards, tom lane


From: "Christopher Browne" <cbbrowne(at)gmail(dot)com>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 02:13:37
Message-ID: d6d6637f0801301813n64fa58eu76385cf8a621907@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Jan 30, 2008 5:58 PM, Decibel! <decibel(at)decibel(dot)org> wrote:
>
> On Mon, Jan 28, 2008 at 11:14:05PM +0000, Christopher Browne wrote:
> > On Dec 6, 2007 6:28 PM, Decibel! <decibel(at)decibel(dot)org> wrote:
> > > FWIW, I've never seen anything but a performance increase or no change
> > > when going from 10 to 100. In most cases there's a noticeable
> > > improvement since it's common to have over 100k rows in a table, and
> > > there's just no way to capture any kind of a real picture of that with
> > > only 10 buckets.
> >
> > I'd be more inclined to try to do something that was at least somewhat
> > data aware.
> >
> > The "interesting theory" that I'd like to verify if I had a chance
> > would be to run through a by-column tuning using a set of heuristics.
> > My "first order approximation" would be:
> >
> > - If a column defines a unique key, then we know there will be no
> > clustering of values, so no need to increase the count...
> >
> > - If a column contains a datestamp, then the distribution of values is
> > likely to be temporal, so no need to increase the count...
> >
> > - If a column has a highly constricted set of values (e.g. - boolean),
> > then we might *decrease* the count.
> >
> > - We might run a query that runs across the table, looking at
> > frequencies of values, and if it finds a lot of repeated values, we'd
> > increase the count.
> >
> > That's a bit "hand-wavy," but that could lead to both increases and
> > decreases in the histogram sizes. Given that, we can expect the
> > overall stat sizes to not forcibly need to grow *enormously*, because
> > we can hope for there to be cases of shrinkage.
>
> I think that before doing any of that you'd be much better off
> investigating how much performance penalty there is for maxing out
> default_statistict_target. If, as I suspect, it's essentially 0 on
> modern hardware, then I don't think it's worth any more effort.
>
> BTW, that investigation wouldn't just be academic either; if we could
> convince ourselves that there normally wasn't any cost associated with a
> high default_statistics_target, we could increase the default, which
> would reduce the amount of traffic we'd see on -performance about bad
> query plans.

There seems to be *plenty* of evidence out there that the performance
penalty would NOT be "essentially zero."

Tom points out:
eqjoinsel(), for one, is O(N^2) in the number of MCV values kept.

It seems to me that there are cases where we can *REDUCE* the
histogram width, and if we do that, and then pick and choose the
columns where the width increases, the performance penalty may be
"yea, verily *actually* 0."

This fits somewhat with Simon Riggs' discussion earlier in the month
about Segment Exclusion; these both represent cases where it is quite
likely that there is emergent data in our tables that can help us to
better optimize our queries.
--
http://linuxfinances.info/info/linuxdistributions.html
"The definition of insanity is doing the same thing over and over and
expecting different results." -- assortedly attributed to Albert
Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling


From: Decibel! <decibel(at)decibel(dot)org>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 07:28:24
Message-ID: 20080131072824.GG1212@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, Jan 30, 2008 at 09:13:37PM -0500, Christopher Browne wrote:
> There seems to be *plenty* of evidence out there that the performance
> penalty would NOT be "essentially zero."
>
> Tom points out:
> eqjoinsel(), for one, is O(N^2) in the number of MCV values kept.
>
> It seems to me that there are cases where we can *REDUCE* the
> histogram width, and if we do that, and then pick and choose the
> columns where the width increases, the performance penalty may be
> "yea, verily *actually* 0."
>
> This fits somewhat with Simon Riggs' discussion earlier in the month
> about Segment Exclusion; these both represent cases where it is quite
> likely that there is emergent data in our tables that can help us to
> better optimize our queries.

This is all still hand-waving until someone actually measures what the
impact of the stats target is on planner time. I would suggest actually
measuring that before trying to invent more machinery. Besides, I think
you'll need that data for the machinery to make an intelligent decision
anyway...

BTW, with autovacuum I don't really see why we should care about how
long analyze takes, though perhaps it should have a throttle ala
vacuum_cost_delay.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Christopher Browne <cbbrowne(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 12:31:45
Message-ID: 20080131123145.GF5145@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Decibel! escribió:

> BTW, with autovacuum I don't really see why we should care about how
> long analyze takes, though perhaps it should have a throttle ala
> vacuum_cost_delay.

Analyze already has vacuum delay points (i.e. it is already throttled).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Decibel!" <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 14:55:35
Message-ID: 47A18D07.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

>>> On Wed, Jan 30, 2008 at 8:13 PM, in message
<d6d6637f0801301813n64fa58eu76385cf8a621907(at)mail(dot)gmail(dot)com>, "Christopher
Browne" <cbbrowne(at)gmail(dot)com> wrote:

> There seems to be *plenty* of evidence out there that the performance
> penalty would NOT be "essentially zero."

I can confirm that I have had performance tank because of boosting
the statistics target for selected columns. It appeared to be time
spent in the planning phase, not a bad plan choice. Reducing the
numbers restored decent performance.

-Kevin


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Decibel!" <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-01-31 16:55:48
Message-ID: 47A1FD94.2070109@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Kevin Grittner wrote:
>>>> On Wed, Jan 30, 2008 at 8:13 PM, in message
> <d6d6637f0801301813n64fa58eu76385cf8a621907(at)mail(dot)gmail(dot)com>, "Christopher
> Browne" <cbbrowne(at)gmail(dot)com> wrote:
>
>> There seems to be *plenty* of evidence out there that the performance
>> penalty would NOT be "essentially zero."
>
> I can confirm that I have had performance tank because of boosting
> the statistics target for selected columns. It appeared to be time
> spent in the planning phase, not a bad plan choice. Reducing the
> numbers restored decent performance.

One idea I've been thinking about is to add a step after the analyze, to
look at the statistics that was gathered. If it looks like the the
distribution is pretty flat, reduce the data to a smaller set before
storing it in pg_statistic.

You would still get the hit of longer ANALYZE time, but at least you
would avoid the hit on query performance where the higher statistics are
not helpful. We could also print an INFO line along the lines of "you
might as well lower the statistics target for this table, because it's
not helping".

No, I don't know how to determine when you could reduce the data, or how
to reduce it...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Decibel!" <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-02-01 04:19:59
Message-ID: 200801312319.59723.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thursday 31 January 2008 09:55, Kevin Grittner wrote:
> >>> On Wed, Jan 30, 2008 at 8:13 PM, in message
>
> <d6d6637f0801301813n64fa58eu76385cf8a621907(at)mail(dot)gmail(dot)com>, "Christopher
>
> Browne" <cbbrowne(at)gmail(dot)com> wrote:
> > There seems to be *plenty* of evidence out there that the performance
> > penalty would NOT be "essentially zero."
>
> I can confirm that I have had performance tank because of boosting
> the statistics target for selected columns. It appeared to be time
> spent in the planning phase, not a bad plan choice. Reducing the
> numbers restored decent performance.
>

Bad plans from boosting to 100 or less? Or something much higher?

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>
Cc: "Decibel!" <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-02-01 15:18:00
Message-ID: 47A2E3C8.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

>>> On Thu, Jan 31, 2008 at 10:19 PM, in message
<200801312319(dot)59723(dot)xzilla(at)users(dot)sourceforge(dot)net>, Robert Treat
<xzilla(at)users(dot)sourceforge(dot)net> wrote:
> On Thursday 31 January 2008 09:55, Kevin Grittner wrote:
>>
>> I can confirm that I have had performance tank because of boosting
>> the statistics target for selected columns. It appeared to be time
>> spent in the planning phase, not a bad plan choice. Reducing the
>> numbers restored decent performance.
>
> Bad plans from boosting to 100 or less? Or something much higher?

I boosted on a large number of columns based on domains. County
number columns (present in most tables) were set to 80. Some
columns were set all the way to 1000. When performance tanked, we
didn't have time to experiment, so we just backed it all out.
Perhaps I could do some more controlled testing soon against 8.3,
to narrow it down and confirm the current status of the issue. I
do seem to recall that simple queries weren't suffering, it was
those which joined many tables which had multiple indexes.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgresql(dot)org, "Robert Treat" <xzilla(at)users(dot)sourceforge(dot)net>, "Decibel!" <decibel(at)decibel(dot)org>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>
Subject: Re: [PATCHES] Better default_statistics_target
Date: 2008-02-01 15:50:16
Message-ID: 24107.1201881016@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> On Thu, Jan 31, 2008 at 10:19 PM, in message
> <200801312319(dot)59723(dot)xzilla(at)users(dot)sourceforge(dot)net>, Robert Treat
> <xzilla(at)users(dot)sourceforge(dot)net> wrote:
>> Bad plans from boosting to 100 or less? Or something much higher?

> I boosted on a large number of columns based on domains. County
> number columns (present in most tables) were set to 80. Some
> columns were set all the way to 1000. When performance tanked, we
> didn't have time to experiment, so we just backed it all out.
> Perhaps I could do some more controlled testing soon against 8.3,
> to narrow it down and confirm the current status of the issue. I
> do seem to recall that simple queries weren't suffering, it was
> those which joined many tables which had multiple indexes.

That fits with the idea that eqjoinsel() is a main culprit.

regards, tom lane


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Better default_statistics_target
Date: 2008-02-03 18:52:33
Message-ID: ef95a1e9cc91b61000e72c2b941b0312@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

> As Tom stated it earlier, the ANALYZE slow down is far from being the
> only consequence. The planner will also have more work to do and
> that's the hard point IMHO.
>
> Without studying the impacts of this change on a large set of queries
> in different cases, it's quite hard to know for sure that it won't
> have a negative impact in a lot of cases.
>
> It's a bit too late in the cycle to change that IMHO, especially
> without any numbers.

The decision to add the magic "99/100" number was made without any
such analysis either, and I can assure you it has caused lots of real-world
problems. Going from 10 to 100 adds a small amount of planner overhead. The
99/100 change adds an order of magnitude speed difference to SELECT queries.
I still cannot see that as anything other than a major performance regression.

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
PGP Key: 0x14964AC8 200802032259
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iD8DBQFHpo3jvJuQZxSWSsgRA61dAJ4hglXzi/EQT08j/NSWl8UeqI9CigCcDxSs
ob//pk7+jTCWPKlssAYKmy8=
=VKhG
-----END PGP SIGNATURE-----