Maximum statistics target

Lists: pgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Maximum statistics target
Date: 2008-03-07 18:25:25
Message-ID: 200803071925.26532.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Related to the concurrent discussion about selectivity estimations ...

What is the reason the statistics target is limited to 1000? I've seen more
than one case where increasing the statistics target to 1000 improved results
and one would have wanted to increase it further.

What's the problem with setting it to ten million if I have ten million values
in the table and I am prepared to spend the resources to maintain those
statistics?


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Maximum statistics target
Date: 2008-03-07 18:41:57
Message-ID: 20080307184157.GA23644@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 07, 2008 at 07:25:25PM +0100, Peter Eisentraut wrote:
> What's the problem with setting it to ten million if I have ten million values
> in the table and I am prepared to spend the resources to maintain those
> statistics?

That it'll probably take 10 million seconds to calculate the plans
using it? I think Tom pointed there are a few places that are O(n^2)
the number entries...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Maximum statistics target
Date: 2008-03-07 21:48:45
Message-ID: 28138.1204926525@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Fri, Mar 07, 2008 at 07:25:25PM +0100, Peter Eisentraut wrote:
>> What's the problem with setting it to ten million if I have ten million values
>> in the table and I am prepared to spend the resources to maintain those
>> statistics?

> That it'll probably take 10 million seconds to calculate the plans
> using it? I think Tom pointed there are a few places that are O(n^2)
> the number entries...

I'm not wedded to the number 1000 in particular --- obviously that's
just a round number. But it would be good to see some performance tests
with larger settings before deciding that we don't need a limit.

IIRC, egjoinsel is one of the weak spots, so tests involving planning of
joins between two tables with large MCV lists would be a good place to
start.

regards, tom lane


From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-09 21:24:18
Message-ID: F0238EBA67824444BC1CB4700960CB4804D2F049@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > On Fri, Mar 07, 2008 at 07:25:25PM +0100, Peter Eisentraut wrote:
> >> What's the problem with setting it to ten million if I
> have ten million values
> >> in the table and I am prepared to spend the resources to
> maintain those
> >> statistics?
>
> > That it'll probably take 10 million seconds to calculate the plans
> > using it? I think Tom pointed there are a few places that are O(n^2)
> > the number entries...
>
> I'm not wedded to the number 1000 in particular --- obviously that's
> just a round number. But it would be good to see some
> performance tests
> with larger settings before deciding that we don't need a limit.

I recently encountered a situation where I would have liked to be able to try a larger limit (amongst other ideas for improving my situation):

I have a field whose distribution of frequencies of values is roughly geometric, rather than flat.
Total rows = 36 million
relpages=504864
Distinct field values in use = 169
10 values account for 50% of the rows.
41 values account for 90% of the rows.

After setting statistics target to 1000 for that field, and analyzing the table, the statistics row for that field had 75 most frequent values and a histogram with 76 entries in it. Estimating 151 values in total.

For this situation using a larger statistics target should result in more pages being read, and a more accurate record of statistics. It shouldn't result in significantly more work for the planner.

It wouldn't solve my problem though, which is frequent over-estimation of rows when restricting by this field with values not known at plan time.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________


From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Estimating geometric distributions
Date: 2008-03-09 22:41:16
Message-ID: F0238EBA67824444BC1CB4700960CB4804D2F0EE@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I have a field whose distribution of frequencies of values is
> roughly geometric, rather than flat.
> Total rows = 36 million
> relpages=504864
> Distinct field values in use = 169
> 10 values account for 50% of the rows.
> 41 values account for 90% of the rows.
>
> After setting statistics target to 1000 for that field, and
> analyzing the table, the statistics row for that field had 75
> most frequent values and a histogram with 77 entries in it.
> Estimating 152 values in total.

"public";"mytable";"myfield";0;4;152;"{202,179,8,181,173,207,6,118,107,205,182,4,54,247,168,77,169,53,120,159,149,174,167,156,148,150,56,108,66,119,5,99,96,175,97,208,1,130,10,102,228,101,121,50,11,152,32,12,78,221,55,244,241,252,203,116,103,184,154,153,238,65,49,220,83,98,111,85,139,242,240,260,7,109,114}";"{0.0836433,0.0781667,0.0738367,0.0598533,0.04629,0.04447,0.0359833,0.0314267,0.0278333,0.0268,0.0251433,0.0244867,0.02438,0.0223433,0.0207567,0.0189667,0.0168833,0.01582,0.0150267,0.0141767,0.0130467,0.0128933,0.0125767,0.0123567,0.0116567,0.0114967,0.01048,0.01037,0.00994667,0.00987667,0.00977667,0.00965333,0.00916333,0.00828667,0.00732667,0.00712,0.00629,0.00624,0.00576667,0.00558667,0.00477667,0.00475333,0.00410333,0.00405667,0.00371667,0.00334667,0.00334,0.00312667,0.00312667,0.00302,0.00300333,0.00295,0.00287333,0.00271,0.00267,0.00240667,0.00224,0.00221333,0.00215333,0.0021,0.00205667,0.00202667,0.00197333,0.00197333,0.00168667,0.00166,0.00159333,0.00159,0.00154667,0.00150333,0.00149,0.00133333,0.00132,0.00112667,0.00104}";"{2,9,9,9,67,76,84,84,86,87,87,88,95,100,100,100,104,105,105,110,112,112,128,137,137,138,143,144,144,144,151,155,155,155,157,157,158,171,171,183,185,185,185,185,187,194,199,199,200,200,201,204,204,209,209,214,214,214,214,215,217,225,225,225,229,239,239,249,250,250,253,253,255,257,261,262,266}";0.449246

My problem is frequent
> over-estimation of rows when restricting by this field with
> values not known at plan time.

examples:
select * from mytable where myfield = ?;
select * from mytable where myfield in (subquery);

My arithmetic mean of the frequencies is 214200
My geometric mean is 13444

However analyze didn't find all my values, and thinks that there are only 152 of them, so it uses a mean of 238046

When the subquery is estimated to return three myfield values, the query estimates 714138 rows, and chooses a sequential scan over mytable (myfield is indexed).

explain select * from mytable where myfield in (values (1),(2),(3));

Hash IN Join (cost=0.08..1009521.37 rows=714138 width=86)
Hash Cond: (mytable.myfield = "*VALUES*".column1)
-> Seq Scan on mytable (cost=0.00..866693.76 rows=36182976 width=86)
-> Hash (cost=0.04..0.04 rows=3 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)

I think this query is much more likely to return around 40000 rows, and a Bitmap Index Scan should be used.

explain select * from mytable where myfield in (values (1),(2));

Nested Loop (cost=4445.11..931383.93 rows=476092 width=86)
-> Unique (cost=0.04..0.04 rows=2 width=4)
-> Sort (cost=0.04..0.04 rows=2 width=4)
Sort Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4)
-> Bitmap Heap Scan on mytable (cost=4445.08..462716.37 rows=238046 width=86)
Recheck Cond: (mytable.myfield = "*VALUES*".column1)
-> Bitmap Index Scan on myindex (cost=0.00..4385.56 rows=238046 width=0)
Index Cond: (mytable.myfield = "*VALUES*".column1)

The expected number of loops (2 here, 3 above) through the Bitmap Heap Scan * 462716.37 > 1009521.37, but the cost estimate is far too high in the general case. It should be closer to 26000 per loop if adjusted for my expectation of the number of rows, being 13444 per loop. As such, you should need to expect close to 40 myfield values being returned by the subquery before choosing a sequential scan.

Is there any facility already in PostgreSQL to help me here?

Hopefully an index type that I don't know about yet? (Geometric distributions are similar to those found in word count distributions).

If not... is there any merit in this idea:

During the analyze process, the geometric mean of sampled rows was calculated, and if determined to be significantly different from the arithmetic mean, stored in a new stats column. When estimating the number of rows that will be returned by queries of the form shown above, if there is a geometric mean stored, use it instead of the arithmetic mean.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 10:36:03
Message-ID: 200803101136.06204.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Freitag, 7. März 2008 schrieb Tom Lane:
> I'm not wedded to the number 1000 in particular --- obviously that's
> just a round number. But it would be good to see some performance tests
> with larger settings before deciding that we don't need a limit.

Well, I'm not saying we should raise the default statistics target. But
setting an arbitrary limit on the grounds that larger values might slow the
system is like limiting the size of tables because larger tables will cause
slower queries. Users should have the option of finding out the best balance
for themselves. If there are concerns with larger statistics targets, we
should document them. I find nothing about this in the documentation at the
moment.

> IIRC, egjoinsel is one of the weak spots, so tests involving planning of
> joins between two tables with large MCV lists would be a good place to
> start.

I have run tests with joining two and three tables with 10 million rows each,
and the planning times seem to be virtually unaffected by the statistics
target, for values between 10 and 800000. They all look more or less like
this:

test=# explain select * from test1, test2 where test1.a = test2.b;
QUERY PLAN
-----------------------------------------------------------------------------
Hash Join (cost=308311.00..819748.00 rows=10000000 width=16)
Hash Cond: (test1.a = test2.b)
-> Seq Scan on test1 (cost=0.00..144248.00 rows=10000000 width=8)
-> Hash (cost=144248.00..144248.00 rows=10000000 width=8)
-> Seq Scan on test2 (cost=0.00..144248.00 rows=10000000 width=8)
(5 rows)

Time: 132,350 ms

and with indexes

test=# explain select * from test1, test2 where test1.a = test2.b;
QUERY PLAN
--------------------------------------------------------------------------------------------
Merge Join (cost=210416.65..714072.26 rows=10000000 width=16)
Merge Cond: (test1.a = test2.b)
-> Index Scan using test1_index1 on test1 (cost=0.00..282036.13
rows=10000000 width=8)
-> Index Scan using test2_index1 on test2 (cost=0.00..282036.13
rows=10000000 width=8)
(4 rows)

Time: 168,455 ms

The time to analyze is also quite constant, just before you run out of
memory. :) The MaxAllocSize is the limiting factor in all this. In my
example, statistics targets larger than about 800000 created pg_statistic
rows that would have been larger than 1GB, so they couldn't be stored.

I suggest that we get rid of the limit of 1000, adequately document whatever
issues might exist with large values (possibly not many, see above), and add
an error message more user-friendly than "invalid memory alloc request size"
for the cases where the value is too large to be storable.


From: Cédric Villemain <cedric(dot)villemain(at)dalibo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 10:58:37
Message-ID: 200803101158.42022.cedric.villemain@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le Monday 10 March 2008, Peter Eisentraut a écrit :
> Am Freitag, 7. März 2008 schrieb Tom Lane:
> > I'm not wedded to the number 1000 in particular --- obviously that's
> > just a round number. But it would be good to see some performance tests
> > with larger settings before deciding that we don't need a limit.
>
> Well, I'm not saying we should raise the default statistics target. But
> setting an arbitrary limit on the grounds that larger values might slow the
> system is like limiting the size of tables because larger tables will cause
> slower queries. Users should have the option of finding out the best
> balance for themselves. If there are concerns with larger statistics
> targets, we should document them. I find nothing about this in the
> documentation at the moment.

I find 2 things:
«Increasing the target causes a proportional increase in the time and space
needed to do ANALYZE. »
in http://www.postgresql.org/docs/current/static/sql-analyze.html
and
« ... at the price of consuming more space in pg_statistic and slightly more
time to compute the estimates»
in http://www.postgresql.org/docs/current/static/planner-stats.html

But probably not clear enought about time impact in query plan.

>
> > IIRC, egjoinsel is one of the weak spots, so tests involving planning of
> > joins between two tables with large MCV lists would be a good place to
> > start.
>
> I have run tests with joining two and three tables with 10 million rows
> each, and the planning times seem to be virtually unaffected by the
> statistics target, for values between 10 and 800000. They all look more or
> less like this:
>
> test=# explain select * from test1, test2 where test1.a = test2.b;
> QUERY PLAN
> ---------------------------------------------------------------------------
>-- Hash Join (cost=308311.00..819748.00 rows=10000000 width=16)
> Hash Cond: (test1.a = test2.b)
> -> Seq Scan on test1 (cost=0.00..144248.00 rows=10000000 width=8)
> -> Hash (cost=144248.00..144248.00 rows=10000000 width=8)
> -> Seq Scan on test2 (cost=0.00..144248.00 rows=10000000
> width=8) (5 rows)
>
> Time: 132,350 ms
>
> and with indexes
>
> test=# explain select * from test1, test2 where test1.a = test2.b;
> QUERY PLAN
> ---------------------------------------------------------------------------
>----------------- Merge Join (cost=210416.65..714072.26 rows=10000000
> width=16)
> Merge Cond: (test1.a = test2.b)
> -> Index Scan using test1_index1 on test1 (cost=0.00..282036.13
> rows=10000000 width=8)
> -> Index Scan using test2_index1 on test2 (cost=0.00..282036.13
> rows=10000000 width=8)
> (4 rows)
>
> Time: 168,455 ms
>
> The time to analyze is also quite constant, just before you run out of
> memory. :) The MaxAllocSize is the limiting factor in all this. In my
> example, statistics targets larger than about 800000 created pg_statistic
> rows that would have been larger than 1GB, so they couldn't be stored.
>
> I suggest that we get rid of the limit of 1000, adequately document
> whatever issues might exist with large values (possibly not many, see
> above), and add an error message more user-friendly than "invalid memory
> alloc request size" for the cases where the value is too large to be
> storable.

--
Cédric Villemain
Administrateur de Base de Données
Cel: +33 (0)6 74 15 56 53
http://dalibo.com - http://dalibo.org


From: "Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To: "Peter Eisentraut" <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 12:28:59
Message-ID: 1d4e0c10803100528w5b994471ja0253bd898e9a0d6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 10, 2008 at 11:36 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> The time to analyze is also quite constant, just before you run out of
> memory. :) The MaxAllocSize is the limiting factor in all this. In my
> example, statistics targets larger than about 800000 created pg_statistic
> rows that would have been larger than 1GB, so they couldn't be stored.

From my experience on real life examples, the time to analyze is far
from being constant when you raise the statistics target but it may be
related to the schema of our tables.

cityvox=# \timing
Timing is on.
cityvox=# show default_statistics_target ;
default_statistics_target
---------------------------
10
(1 row)

Time: 0.101 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 406.069 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 412.355 ms
cityvox=# set default_statistics_target = 30;
SET
Time: 0.165 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 1419.161 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 1381.754 ms
cityvox=# set default_statistics_target = 100;
SET
Time: 1.853 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 5211.785 ms
cityvox=# ANALYZE evenement;
ANALYZE
Time: 5178.764 ms

That said I totally agree that it's not a good idea to have a strict
maximum value if we haven't technical reasons for that.

--
Guillaume


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 12:46:34
Message-ID: 12262.1205153194@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Am Freitag, 7. Mrz 2008 schrieb Tom Lane:
>> IIRC, egjoinsel is one of the weak spots, so tests involving planning of
>> joins between two tables with large MCV lists would be a good place to
>> start.

> I have run tests with joining two and three tables with 10 million rows each,
> and the planning times seem to be virtually unaffected by the statistics
> target, for values between 10 and 800000.

It's not possible to believe that you'd not notice O(N^2) behavior for N
approaching 800000 ;-). Perhaps your join columns were unique keys, and
thus didn't have any most-common-values?

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 13:43:09
Message-ID: 87ejaibspe.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
>> Am Freitag, 7. März 2008 schrieb Tom Lane:
>>> IIRC, egjoinsel is one of the weak spots, so tests involving planning of
>>> joins between two tables with large MCV lists would be a good place to
>>> start.
>
>> I have run tests with joining two and three tables with 10 million rows each,
>> and the planning times seem to be virtually unaffected by the statistics
>> target, for values between 10 and 800000.
>
> It's not possible to believe that you'd not notice O(N^2) behavior for N
> approaching 800000 ;-). Perhaps your join columns were unique keys, and
> thus didn't have any most-common-values?

We could remove the hard limit on statistics target and impose the limit
instead on the actual size of the arrays. Ie, allow people to specify larger
sample sizes and discard unreasonably large excess data (possibly warning them
when that happens).

That would remove the screw case the original poster had where he needed to
scan a large portion of the table to see at least one of every value even
though there were only 169 distinct values.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 18:26:43
Message-ID: 200803101926.43786.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Montag, 10. März 2008 schrieb Gregory Stark:
> > It's not possible to believe that you'd not notice O(N^2) behavior for N
> > approaching 800000 ;-). Perhaps your join columns were unique keys, and
> > thus didn't have any most-common-values?
>
> We could remove the hard limit on statistics target and impose the limit
> instead on the actual size of the arrays. Ie, allow people to specify
> larger sample sizes and discard unreasonably large excess data (possibly
> warning them when that happens).

I have run some more useful tests now with more distinct values. The planning
times do increase, but this is not the primary worry. If you want to spend
20 seconds of planning to speed up your query by 40 seconds, this could
surely be a win in some scenarios, and not a catastrophic loss if not. The
practical problems lie with memory usage in ANALYZE, in two ways. First, at
some point it will try to construct pg_statistic rows that don't fit into the
1GB limit, as mentioned upthread. You get a funny error message and it
aborts. This is fixable with some cosmetics. Second, ANALYZE appears to
temporarily leak memory (it probably doesn't bother to free things along the
way, as most of the code does), and so some not so large statistics targets
(say, 40000) can get your system swapping like crazy. A crafty user could
probably kill the system that way, perhaps even with the restricted settings
we have now. I haven't inspected the code in detail yet, but I imagine a few
pfree() calls and/or a counter that checks the current memory usage against
maintenance_work_mem could provide additional safety. If we could get
ANALYZE under control, then I imagine this would provide a more natural upper
bound for the statistics targets, and it would be controllable by the
administrator.


From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-10 20:57:15
Message-ID: F0238EBA67824444BC1CB4700960CB4804DD8FAB@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We could remove the hard limit on statistics target and
> impose the limit
> instead on the actual size of the arrays. Ie, allow people to
> specify larger
> sample sizes and discard unreasonably large excess data
> (possibly warning them
> when that happens).
>
> That would remove the screw case the original poster had
> where he needed to
> scan a large portion of the table to see at least one of
> every value even
> though there were only 169 distinct values.
>
> --
> Gregory Stark

That was my use case, but I wasn't the OP.

Your suggestion would satisfy what I was trying to do. However, a higher stats target wouldn't solve my root problem (how the planner uses the gathered stats), and the statistics gathered at 1000 (and indeed at 200) are quite a good representation of what is in the table.

I don't like the idea of changing one limit into two limits. Or are you suggesting changing the algorithm that determines how many, and which pages to analyze, perhaps so that it is adaptive to the results of the analysis as it progresses? That doesn't sound easy.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________


From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Estimating geometric distributions
Date: 2008-03-11 20:43:11
Message-ID: F0238EBA67824444BC1CB4700960CB4804DD9399@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> > I have a field whose distribution of frequencies of values is
> > roughly geometric, rather than flat.

> My problem is frequent
> > over-estimation of rows when restricting by this field with
> > values not known at plan time.

> Is there any facility already in PostgreSQL to help me here?
>
> Hopefully an index type that I don't know about yet?
> (Geometric distributions are similar to those found in word
> count distributions).
>
> If not... is there any merit in this idea:
>
> During the analyze process, the geometric mean of sampled
> rows was calculated, and if determined to be significantly
> different from the arithmetic mean, stored in a new stats
> column. When estimating the number of rows that will be
> returned by queries of the form shown above, if there is a
> geometric mean stored, use it instead of the arithmetic mean.

I came up with another (much easier) means of adjusting the planners estimation of how many rows will be returned:

Increase the number of distinct values in the statistics.
For example:
update pg_statistic set stadistinct=2691 where starelid=29323 and staattnum=2;

I can then pick a number of distinct values such that the effective arithmetic mean is equal to what I calculated the geometric mean to be.

Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________


From: Decibel! <decibel(at)decibel(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Gregory Stark <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-20 16:17:10
Message-ID: 588963E3-CCD4-4549-B46E-56F3365F014C@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 10, 2008, at 1:26 PM, Peter Eisentraut wrote:
> Am Montag, 10. März 2008 schrieb Gregory Stark:
>>> It's not possible to believe that you'd not notice O(N^2)
>>> behavior for N
>>> approaching 800000 ;-). Perhaps your join columns were unique
>>> keys, and
>>> thus didn't have any most-common-values?
>>
>> We could remove the hard limit on statistics target and impose the
>> limit
>> instead on the actual size of the arrays. Ie, allow people to specify
>> larger sample sizes and discard unreasonably large excess data
>> (possibly
>> warning them when that happens).
>
> I have run some more useful tests now with more distinct values.
> The planning
> times do increase, but this is not the primary worry. If you want
> to spend
> 20 seconds of planning to speed up your query by 40 seconds, this
> could
> surely be a win in some scenarios, and not a catastrophic loss if
> not. The
> practical problems lie with memory usage in ANALYZE, in two ways.
> First, at
> some point it will try to construct pg_statistic rows that don't
> fit into the
> 1GB limit, as mentioned upthread. You get a funny error message
> and it
> aborts. This is fixable with some cosmetics. Second, ANALYZE
> appears to
> temporarily leak memory (it probably doesn't bother to free things
> along the
> way, as most of the code does), and so some not so large statistics
> targets
> (say, 40000) can get your system swapping like crazy. A crafty
> user could
> probably kill the system that way, perhaps even with the restricted
> settings
> we have now. I haven't inspected the code in detail yet, but I
> imagine a few
> pfree() calls and/or a counter that checks the current memory usage
> against
> maintenance_work_mem could provide additional safety. If we could get
> ANALYZE under control, then I imagine this would provide a more
> natural upper
> bound for the statistics targets, and it would be controllable by the
> administrator.

At some point I think it makes a lot more sense to just have VACUUM
gather stats as it goes, rather than have ANALYZE generate a bunch of
random IO.

BTW, when it comes to the case of the OP, perhaps we can build enough
intelligence for the system to understand when the stats follow some
type of pattern (ie: a geometric distribution), and store the stats
differently.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Kenneth Marshall <ktm(at)rice(dot)edu>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Maximum statistics target
Date: 2008-03-20 19:48:20
Message-ID: 20080320194820.GF27394@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 20, 2008 at 11:17:10AM -0500, Decibel! wrote:
> On Mar 10, 2008, at 1:26 PM, Peter Eisentraut wrote:
>
> At some point I think it makes a lot more sense to just have VACUUM gather
> stats as it goes, rather than have ANALYZE generate a bunch of random IO.
>
> BTW, when it comes to the case of the OP, perhaps we can build enough
> intelligence for the system to understand when the stats follow some type
> of pattern (ie: a geometric distribution), and store the stats differently.
> --
> Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
> Give your computer some brain candy! www.distributed.net Team #1828
>
+1 for opportunistically gathering stats during other I/O such as
vacuums and sequential scans. It would be interesting to have a hook
to allow processes to attach to the dataflow from other queries.

Cheers,
Ken