Quick Links

Re: Query optimizer 8.0.1 (and 8.0)

Lists:	pgsql-hackers

From:	pgsql(at)mohawksoft(dot)com
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-04 22:10:18
Message-ID:	16398.24.91.171.78.1107555018.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's one:
I have the USA census TIGER database loaded, the WHOLE THING, the whole
country. It isn't the biggest database, but it is about 40G before
indexes. Every table is over a few million rows. I can quite honestly say,
a sequential scan is almost never the right thing to do.

Info below.

Here is the query:
select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or
zipl=2186);

Well, when you look at the explain, obviously it looks like the sequential
scan is better, but trust me, it takes a good few minutes for the query to
respond with sequential scan enabled. I've run analyze and everything.

I suspect that analyze only samples a very small amount of the database
and gets the wrong idea about it. Is there a way to force analyze to
sample more rows?

tiger=# analyze verbose rt2 ;
INFO: analyzing "public.rt2"
INFO: "rt2": scanned 3000 of 1139825 pages, containing 84160 live rows
and 0 dead rows; 3000 rows in sample, 31975891 estimated total rows
ANALYZE
tiger=# analyze verbose rt1 ;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 3000 of 1527360 pages, containing 90951 live rows
and 0 dead rows; 3000 rows in sample, 46304973 estimated total rows
ANALYZE

tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
= 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Hash Join (cost=121978.81..3996190.89 rows=21118 width=520)
Hash Cond: ("outer".tlid = "inner".tlid)
-> Seq Scan on rt2 (cost=0.00..1459291.36 rows=31946636 width=218)
-> Hash (cost=120662.36..120662.36 rows=30579 width=302)
-> Index Scan using rt1_zipr, rt1_zipl on rt1
(cost=0.00..120662.36 rows=30579 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
(6 rows)

tiger=# set enable_seqscan=no;
SET
tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
= 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..46868256.00 rows=21118 width=520)
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36
rows=30579 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
-> Index Scan using rt2_tlid on rt2 (cost=0.00..1523.80 rows=396
width=218)
Index Cond: ("outer".tlid = rt2.tlid)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql(at)mohawksoft(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-04 22:17:35
Message-ID:	17961.1107555455@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com writes:
> I suspect that analyze only samples a very small amount of the database
> and gets the wrong idea about it. Is there a way to force analyze to
> sample more rows?

default_statistics_target. But let's see the pg_stats rows for these
columns before assuming that analyze is getting it wrong.

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	pgsql(at)mohawksoft(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-04 22:23:27
Message-ID:	20050204222327.GK10437@ns.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* pgsql(at)mohawksoft(dot)com (pgsql(at)mohawksoft(dot)com) wrote:
> I have the USA census TIGER database loaded, the WHOLE THING, the whole
> country. It isn't the biggest database, but it is about 40G before
> indexes. Every table is over a few million rows. I can quite honestly say,
> a sequential scan is almost never the right thing to do.

Cool, I've got it loaded here too w/ PostGIS. It's good stuff. :) I'm
curious what all you're doing with it and especially if you're using
mapserver on it or have found a way to identify roads at a more
street/city/state/country level, I've had problems doing that.

> I suspect that analyze only samples a very small amount of the database
> and gets the wrong idea about it. Is there a way to force analyze to
> sample more rows?

Yup:

alter table <blah> alter column <blah> set statistics <blah>

As I recall, the default is 100 for statistics, I'd say increase that
number to like 200 or 300 or more and see if it helps.

Stephen

From:	pgsql(at)mohawksoft(dot)com
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-04 23:50:30
Message-ID:	16409.24.91.171.78.1107561030.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>> I suspect that analyze only samples a very small amount of the database
>> and gets the wrong idea about it. Is there a way to force analyze to
>> sample more rows?
>
> default_statistics_target. But let's see the pg_stats rows for these
> columns before assuming that analyze is getting it wrong.

Some more info:

I did a select count(distinct(tlid)) from rt2, and updated the statistics
with the result:

tiger=# update pg_statistic set stadistinct = 23656799 where starelid =
17236 and staattnum = 1;
UPDATE 1

tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr
tiger(# = 2186 or zipl=2186);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..425517.95 rows=21315 width=520)
-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
rows=30835 width=302)
Index Cond: ((zipr = 2186) OR (zipl = 2186))
-> Index Scan using rt2_tlid on rt2 (cost=0.00..9.82 rows=2 width=218)
Index Cond: ("outer".tlid = rt2.tlid)
(5 rows)
tiger=# SELECT attname, n_distinct, most_common_vals FROM pg_stats
WHERE tablename = 'rt1';
attname | n_distinct |
most_common_vals
-----------+------------+-----------------------------------------------------------------------------------------------------------------------
tlid | -1 |
side1 | 1 | {1}
source | 9 | {B,J,A,K,L,N,O,M,C}
fedirp | 8 | {N,E,S,W,SW,NE,NW,SE}
fename | 2590 | {Main,1st,Oak,2nd,9th,"Burlington Northern Santa
Fe R",Park,4th,11th,8th}
fetype | 26 | {Rd,St,Ave,Dr}
fedirs | 9 | {NE,SE,N,E,NW,SW,W,S,O}
cfcc | 51 | {A41,H12,H11,F10,A74,A31,H01}
fraddl | 767 | {1,101,2,201,301,401,100,701,298,500}
toaddl | 805 | {199,399,299,499,98,198,99,1,100,2}
fraddr | 765 | {2,1,100,200,400,300,700,101,501,299}
toaddr | 772 | {198,398,298,498,99,98,199,101,1,1098}
friaddl | 3 | {0,1,2}
toiaddl | 3 | {0,1,2}
friaddr | 3 | {0}
toiaddr | 3 | {0}
zipl | 925 |
zipr | 899 |
aianhhfpl | 42 |
aianhhfpr | 43 |
aihhtlil | 2 |
aihhtlir | 2 |
statel | 55 | {48,06,12,37,29,42,17,36,13,39}
stater | 55 | {48,06,12,37,29,42,17,36,13,39}
countyl | 189 | {005,059,013,029,003,001,017,009,031,043}
countyr | 191 | {005,059,013,001,003,029,017,025,031,043}
cousubl | 2568 |
{92601,91800,90270,90240,90468,90572,91508,91750,60000,90324}
cousubr | 2598 |
{92601,91800,90270,90240,90468,90572,91248,91750,60000,90324}
submcdl | -1 |
submcdr | -1 |
placel | 778 |
{51000,65000,44000,12000,38000,60000,63460,07000,04000,22000}
placer | 787 |
{51000,65000,12000,44000,60000,07000,38000,55000,63460,04000}
tractl | 1370 |
{950200,950100,950300,000100,970100,960100,980100,950700,970300,990100}
tractr | 1354 |
{950200,950100,950300,000100,970100,960100,990100,950700,980100,970300}
blockl | 1050 | {1000,2000,1001,1003,1005,2001,1009,2006,1002,1004}
blockr | 1055 | {1000,2000,1001,1002,1003,1005,1004,1009,2004,2002}
frlong | 134476 |
{-120.214657,-113.074100,-106.494480,-103.306945,-100.184470,-100.083614,-99.476994,-98.420248,-97.325498,-93.349865}
frlat | 143222 |
{27.759896,29.251454,29.898585,30.093247,31.814071,31.950913,32.055726,32.377503,32.523607,32.607387}
tolong | 317744 |
{-123.330861,-111.673035,-107.596898,-103.164000,-100.945693,-100.080307,-99.576886,-99.492719,-97.743722,-93.870222}
tolat | 278079 |
{27.493816,27.904316,29.691644,32.731410,33.350429,34.490563,35.551053,35.868297,39.139185,40.068098}
(40 rows)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql(at)mohawksoft(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-05 00:03:46
Message-ID:	18635.1107561826@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com writes:
> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
> rows=30835 width=302)
> Index Cond: ((zipr = 2186) OR (zipl = 2186))

> zipl | 925 |
> zipr | 899 |

Those n_distinct values for zipl and zipr seem aberrant --- too low
compared to the estimated rowcount you're showing. What are the
true figures? Also, how about some EXPLAIN ANALYZEs and not just
EXPLAINs?

regards, tom lane

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	pgsql(at)mohawksoft(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-05 03:21:25
Message-ID:	20050205032125.GB47995@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 04, 2005 at 05:23:27PM -0500, Stephen Frost wrote:
> As I recall, the default is 100 for statistics, I'd say increase that
> number to like 200 or 300 or more and see if it helps.

No, it's 10.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	pgsql(at)mohawksoft(dot)com
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql(at)mohawksoft(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-05 12:29:33
Message-ID:	16560.24.91.171.78.1107606573.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
>> rows=30835 width=302)
>> Index Cond: ((zipr = 2186) OR (zipl = 2186))
>
>> zipl | 925 |
>> zipr | 899 |
>
> Those n_distinct values for zipl and zipr seem aberrant --- too low
> compared to the estimated rowcount you're showing. What are the
> true figures? Also, how about some EXPLAIN ANALYZEs and not just
> EXPLAINs?
>
>
^
tiger=# select count(distinct(zipr)), count(distinct(zipl)) from rt1 ;
count | count
-------+-------
35133 | 35393
(1 row)

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-06 17:29:52
Message-ID:	42065410.5010803@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Short summary:

I had the same problem - since the sort order of zip-codes,
counties, city names, and states don't match, the optimizer
grossly overestimated the number of pages that would be read.

I bet doing a CLUSTER by ZIP would solve that particular
query, but would break similar queries by county or by
city or by state.

I think
select attname,correlation from pg_stats where tablename = 'rt1';
will show you the same problem I had. My pg_stats
numbers are shown below.

I had a couple ugly hacks that worked around the problem
for myself.

Longer.

One interesting property of the TIGER data (and most geospatial
databases) is that the data for most of the columns are
"locally correlated" but not globally. What I mean by that
is that even though data for those zip-codes probably only
resided in a few disk pages; the data was probably loaded
in order of Tiger File number (state,county), so the "correlation"
in pg_stats for zip-code was very low. With the low correlation
the optimizer couldn't see the fact that any given zip code's
data was all together on disk. Because of this, it probably
vastly overestimated the number of pages that would be read.

Let me try a concrete example:
Zip | City | State | County | Street
------+---------------+----------+---------------+-------------
99501 } Anchorage } AK | Anchorage | 1st st.
[...]
94105 | San Francisco | CA | San Francisco | 1st St
94105 | San Francisco | CA | San Francisco | 2nd St
[... a few more disk pages for 94105 ...]
[... tens more disk pages for San Francisco ...]
[... thousands more disk pages for CA ...]
94501 | Alameda | CA | Alameda | 1st St
94501 | Alameda | CA | Alameda | 2nd St
[...]
02101 | Boston | MA | Suffolk | 1st St.

Note, that all the data for any geographical region (zip,
or city, or county) is located close together on disk.

If I do a query by City, or by State, or by Zip, I should
probably do an index scan.

But since the correlation statistic only looks the total
ordering; if we order the table so the correlation for
one column, it's statistic will look very good; but since
the columns have a different sort order the correlation
statistic for the others will be very poor.

In my copy of the TIGER data (loaded in order of
TIGER-file-number (which is ordered by state, and then
by county)), you can see the various relevant correlation
values in my database.

Note, that even though the TIGER data is sorted by
State/County, "countyl" and "countyr" have some of the
worst correlations (in the stats table); because of
the way the FIPS codes work. Every state re-cycles
the county codes starting with 1 and going up.
STATE FIPS CODE | County FIPS code
------------------+-----------------
06 (california) | 001 (alameda)
06 (california) | 003 (alpine)
...
25 (massachusets) | 001 (Barnstable)

I have a hack for 7.4 that sets the numbers hidden
in pg_statistic used for correlation to 0.75 for these
columns; and the planner started making more reasonable
guesses.

In the end, though, I just resorted to uglier hacks
that make the planner favor index scans like setting
random_page_cost artificially low.

What I think I'd like to see is for there to be
another statistic similar to "correlation" but rather
than looking at the total-ordering of the table, to
look how correlated values within any single page are.
If someone pointed me in the right direction, I might
try doing this.

Ron

PS:

I think lots of other data has the same issues.
A very large name database ordered by
"lastname,firstname" will have all people
of a given "firstname" on a relatively small
set of pages, but the current correlation value
wouldn't see that.

Firstname | Lastname | Middlename
Adam | Brown | Albert
Adam | Brown | Alex
Adam | Brown | Bob
Bill } Brown | ....
...
Adam | Smith | Albert
Adam | Smith | Alex
Adam | Smith | Bob
...

Tom Lane wrote:
> pgsql(at)mohawksoft(dot)com writes:
>
>> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93
>>rows=30835 width=302)
>> Index Cond: ((zipr = 2186) OR (zipl = 2186))
>
>
>> zipl | 925 |
>> zipr | 899 |
>
>
> Those n_distinct values for zipl and zipr seem aberrant --- too low
> compared to the estimated rowcount you're showing. What are the
> true figures? Also, how about some EXPLAIN ANALYZEs and not just
> EXPLAINs?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-06 18:10:42
Message-ID:	5307.1107713442@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
> What I think I'd like to see is for there to be
> another statistic similar to "correlation" but rather
> than looking at the total-ordering of the table, to
> look how correlated values within any single page are.

Our current approach to correlation is surely far too simplistic; it was
just a quick hack to get *something* in there for this consideration.
I don't personally know enough about statistics to do much better :-(

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql(at)mohawksoft(dot)com
Cc:	"Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 00:33:55
Message-ID:	17745.1107736435@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com writes:
> One of the things that is disturbing to me about the analyze settings is
> that it wants to sample the same number of records from a table regardless
> of the size of that table.

The papers that I looked at say that this rule has a good solid
statistical foundation, at least for the case of estimating histograms.
See the references cited in analyze.c.

regards, tom lane

From:	pgsql(at)mohawksoft(dot)com
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 14:30:45
Message-ID:	16881.24.91.171.78.1107786645.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>> One of the things that is disturbing to me about the analyze settings is
>> that it wants to sample the same number of records from a table
>> regardless
>> of the size of that table.
>
> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.
> See the references cited in analyze.c.

Any and all random sampling assumes a degree of uniform distribution. This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.

Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It was
a subtle play on the absurd notion that a very small sample could lead to
an understanding of an infinitly larger whole.

On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?

Another problem with random sampling is trend analysis. Often times there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.

I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size of
the whole, but I think we can do better. I would suggest using the current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql(at)mohawksoft(dot)com
Cc:	"Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 15:32:10
Message-ID:	27032.1107790330@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com writes:
> On a very basic level, why bother sampling the whole table at all? Why not
> check one block and infer all information from that? Because we know that
> isn't enough data. In a table of 4.6 million rows, can you say with any
> mathmatical certainty that a sample of 100 points can be, in any way,
> representative?

This is a statistical argument, not a rhetorical one, and I'm not going
to bother answering handwaving. Show me some mathematical arguments for
a specific sampling rule and I'll listen.

regards, tom lane

From:	pgsql(at)mohawksoft(dot)com
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql(at)mohawksoft(dot)com, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 16:27:59
Message-ID:	16623.24.91.171.78.1107793679.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>> On a very basic level, why bother sampling the whole table at all? Why
>> not
>> check one block and infer all information from that? Because we know
>> that
>> isn't enough data. In a table of 4.6 million rows, can you say with any
>> mathmatical certainty that a sample of 100 points can be, in any way,
>> representative?
>
> This is a statistical argument, not a rhetorical one, and I'm not going
> to bother answering handwaving. Show me some mathematical arguments for
> a specific sampling rule and I'll listen.
>

Tom, I am floored by this response, I am shaking my head in disbelief.

It is inarguable that increasing the sample size increases the accuracy of
a study, especially when diversity of the subject is unknown. It is known
that reducing a sample size increases probability of error in any poll or
study. The required sample size depends on the variance of the whole. It
is mathmatically unsound to ASSUME any sample size is valid without
understanding the standard deviation of the set.

http://geographyfieldwork.com/MinimumSampleSize.htm

Again, I understand why you used the Vitter algorithm, but it has been
proven insufficient (as used) with the US Census TIGER database. We
understand this because we have seen that the random sampling as
implemented has insufficient information to properly characterize the
variance in the data.

From:	Bruno Wolff III <bruno(at)wolff(dot)to>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 17:15:55
Message-ID:	20050207171555.GA2179@wolff.to
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 07, 2005 at 11:27:59 -0500,
pgsql(at)mohawksoft(dot)com wrote:
>
> It is inarguable that increasing the sample size increases the accuracy of
> a study, especially when diversity of the subject is unknown. It is known
> that reducing a sample size increases probability of error in any poll or
> study. The required sample size depends on the variance of the whole. It
> is mathmatically unsound to ASSUME any sample size is valid without
> understanding the standard deviation of the set.

For large populations the accuracy of estimates of statistics based on random
samples from that population are not very sensitve to population size and
depends primarily on the sample size. So that you would not expect to need
to use larger sample sizes on larger data sets for data sets over some
minimum size.

From:	pgsql(at)mohawksoft(dot)com
To:	"Bruno Wolff III" <bruno(at)wolff(dot)to>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 18:28:04
Message-ID:	16805.24.91.171.78.1107800884.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Feb 07, 2005 at 11:27:59 -0500,
> pgsql(at)mohawksoft(dot)com wrote:
>>
>> It is inarguable that increasing the sample size increases the accuracy
>> of
>> a study, especially when diversity of the subject is unknown. It is
>> known
>> that reducing a sample size increases probability of error in any poll
>> or
>> study. The required sample size depends on the variance of the whole. It
>> is mathmatically unsound to ASSUME any sample size is valid without
>> understanding the standard deviation of the set.
>
> For large populations the accuracy of estimates of statistics based on
> random
> samples from that population are not very sensitve to population size and
> depends primarily on the sample size. So that you would not expect to need
> to use larger sample sizes on larger data sets for data sets over some
> minimum size.

That assumes a fairly low standard deviation. If the standard deviation is
low, then a minimal sample size works fine. If there was zero deviation in
the data, then a sample of one works fine.

If the standard deviation is high, then you need more samples. If you have
a high standard deviation and a large data set, you need more samples than
you would need for a smaller data set.

In the current implementation of analyze.c, the default is 100 samples. On
a table of 10,000 rows, that is probably a good number characterize the
data enough for the query optimizer (1% sample). For a table with 4.6
million rows, that's less than 0.002%

Think about an iregularly occuring event, unevenly distributed throughout
the data set. A randomized sample strategy normalized across the whole
data set with too few samples will mischaracterize the event or even miss
it altogether.

From:	Bruno Wolff III <bruno(at)wolff(dot)to>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 20:34:53
Message-ID:	20050207203453.GA7517@wolff.to
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 07, 2005 at 13:28:04 -0500,
> >
> > For large populations the accuracy of estimates of statistics based on
> > random
> > samples from that population are not very sensitve to population size and
> > depends primarily on the sample size. So that you would not expect to need
> > to use larger sample sizes on larger data sets for data sets over some
> > minimum size.
>
> That assumes a fairly low standard deviation. If the standard deviation is
> low, then a minimal sample size works fine. If there was zero deviation in
> the data, then a sample of one works fine.

This doesn't assume a low standard deviation. That wouldn't make sense
anyway since the standard deviation depends on the units used for
the measurements.

> In the current implementation of analyze.c, the default is 100 samples. On
> a table of 10,000 rows, that is probably a good number characterize the
> data enough for the query optimizer (1% sample). For a table with 4.6
> million rows, that's less than 0.002%

The fraction of rows isn't relevant unless it is a large fraction of the
total, in which case you can use a smaller sample than you might otherwise.

> Think about an iregularly occuring event, unevenly distributed throughout
> the data set. A randomized sample strategy normalized across the whole
> data set with too few samples will mischaracterize the event or even miss
> it altogether.

What you are saying here is that if you want more accurate statistics, you
need to sample more rows. That is true. However, the size of the sample
is essentially only dependent on the accuracy you need and not the size
of the population, for large populations.

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 21:22:09
Message-ID:	4207DC01.9060102@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Maybe I am missing something - ISTM that you can increase your
statistics target for those larger tables to obtain a larger (i.e.
better) sample.

regards

Mark

pgsql(at)mohawksoft(dot)com wrote:
>>pgsql(at)mohawksoft(dot)com writes:
> Any and all random sampling assumes a degree of uniform distribution. This
> is the basis of the model. It assumes that chunks of the whole will be
> representative of the whole (to some degree). This works when normal
> variations are more or less distributed uniformly. As variations and
> trends becomes less uniformly distributed, more samples are required to
> characterize it.
>
> Douglas Adams had a great device called the "Total Perspective Vortex"
> which infered the whole of the universe from a piece of fairy cake. It was
> a subtle play on the absurd notion that a very small sample could lead to
> an understanding of an infinitly larger whole.
>
> On a very basic level, why bother sampling the whole table at all? Why not
> check one block and infer all information from that? Because we know that
> isn't enough data. In a table of 4.6 million rows, can you say with any
> mathmatical certainty that a sample of 100 points can be, in any way,
> representative?
>
> Another problem with random sampling is trend analysis. Often times there
> are minor trends in data. Ron pointed out the lastname firstname trend.
> Although there seems to be no correlation between firstnames in the table,
> there are clearly groups or clusters of ordered data that is an ordering
> that is missed by too small a sample.
>
> I understand why you chose the Vitter algorithm, because it provides a
> basically sound methodology for sampling without knowledge of the size of
> the whole, but I think we can do better. I would suggest using the current
> algorithm the first time through, then adjust the number of samples [n]
> based on the previous estimate of the size of the table [N]. Each
> successive ANALYZE will become more accurate. The Vitter algorithm is
> still useful as [N] will always be an estimate.
>

From:	pgsql(at)mohawksoft(dot)com
To:	"Bruno Wolff III" <bruno(at)wolff(dot)to>
Cc:	pgsql(at)mohawksoft(dot)com, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 22:16:56
Message-ID:	16759.24.91.171.78.1107814616.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Feb 07, 2005 at 13:28:04 -0500,
>
> What you are saying here is that if you want more accurate statistics, you
> need to sample more rows. That is true. However, the size of the sample
> is essentially only dependent on the accuracy you need and not the size
> of the population, for large populations.
>
That's nonsense.

If your total data size is 100 elements in a set, then a sample size of
100 elements will cover 100% of your data.

If your total data size is 10,000 elements in a set, the a sample size of
100 elements will cover 1% of your data.

In the case of the TIGER database, the base of 100 samples is about .002%
0f the data is sampled. Think about that, that is an average of 1 sample
about every 50,000 records. You could have substantial but irregular
trends in the data that may never get detected, and this is EXACTLY what
we see. If we increase the sample size (targrows), the statistics suddenly
work better.

For instance, look at the data below.

The first analyze / select from pg_stats is with an analyze of 3000
samples. The zipl and zipr columns get calculated poorly and can cause the
planner to use a table scan instead of an index scan.

The second analyze / select from the pg_stats is with an analyse of 10000
samples. The zipl and zipr n_distinct values are still off by a factor of
10, but close enough for the planner to deal.

If the premise is that samples size doesn't make a difference, I think
we've proved that this is not true.

tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows
and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs
| histogram_bounds
| correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.672 | 4 | 960 |
{76240,52601,55746,71730,74604,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
|
{1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} |
-0.119519
public | rt1 | zipr | 0.677 | 4 | 960 |
{76240,52601,55746,71730,74604,78577,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
| {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006} |
-0.104158
(2 rows)

Now this:
tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows
and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs |
histogram_bounds | correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.6807 | 4 | 2942 |
{61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} |
{0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} |
-0.137744
public | rt1 | zipr | 0.684 | 4 | 2921 |
{13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} |
{0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} |
-0.140663
(2 rows)

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Bruno Wolff III <bruno(at)wolff(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 22:31:33
Message-ID:	20050207223133.GB30539@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql(at)mohawksoft(dot)com wrote:
> > On Mon, Feb 07, 2005 at 13:28:04 -0500,
> >
> > What you are saying here is that if you want more accurate statistics, you
> > need to sample more rows. That is true. However, the size of the sample
> > is essentially only dependent on the accuracy you need and not the size
> > of the population, for large populations.
> >
> That's nonsense.

Huh, have you studied any statistics?

--
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

From:	pgsql(at)mohawksoft(dot)com
To:	"Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>
Cc:	pgsql(at)mohawksoft(dot)com, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 22:36:27
Message-ID:	16397.24.91.171.78.1107815787.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Maybe I am missing something - ISTM that you can increase your
> statistics target for those larger tables to obtain a larger (i.e.
> better) sample.

No one is arguing that you can't manually do things, but I am not the
first to notice this. I saw the query planner doing something completely
stupid and set off to discover why.

Think about the person using PostgreSQL for the first time. He/she does
not know about this stuff. Even if they've read the FAQs and the manual
cover to cover, it will take them some time to figure out it all works
together. PostgreSQL is a big system, and this is exactly why MySQL gets
better marks from newbes.

In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its deviation.

There is no reason why old stats can't be used to create more accurate
stats. Using succesive analyze operations, we could create better
statistics for the planner. We can increase the sample size based on the
table size. We could, I suppose, also calculate some sort of deviation
statistic so that "n_distinct" can be calculated better with a smaller
sample set.

The basic problem, though, is that PostgreSQL performed incorrectly on a
simple query after indexes were created and analyze performed. Yes, it can
be corrected, that's what led me to my conclusions, but shouldn't we try
to devise a better system in the future to improve PostgreSQL so it does
not need this sort of tuning?

>
> regards
>
> Mark
>
> pgsql(at)mohawksoft(dot)com wrote:
>>>pgsql(at)mohawksoft(dot)com writes:
>> Any and all random sampling assumes a degree of uniform distribution.
>> This
>> is the basis of the model. It assumes that chunks of the whole will be
>> representative of the whole (to some degree). This works when normal
>> variations are more or less distributed uniformly. As variations and
>> trends becomes less uniformly distributed, more samples are required to
>> characterize it.
>>
>> Douglas Adams had a great device called the "Total Perspective Vortex"
>> which infered the whole of the universe from a piece of fairy cake. It
>> was
>> a subtle play on the absurd notion that a very small sample could lead
>> to
>> an understanding of an infinitly larger whole.
>>
>> On a very basic level, why bother sampling the whole table at all? Why
>> not
>> check one block and infer all information from that? Because we know
>> that
>> isn't enough data. In a table of 4.6 million rows, can you say with any
>> mathmatical certainty that a sample of 100 points can be, in any way,
>> representative?
>>
>> Another problem with random sampling is trend analysis. Often times
>> there
>> are minor trends in data. Ron pointed out the lastname firstname trend.
>> Although there seems to be no correlation between firstnames in the
>> table,
>> there are clearly groups or clusters of ordered data that is an ordering
>> that is missed by too small a sample.
>>
>> I understand why you chose the Vitter algorithm, because it provides a
>> basically sound methodology for sampling without knowledge of the size
>> of
>> the whole, but I think we can do better. I would suggest using the
>> current
>> algorithm the first time through, then adjust the number of samples [n]
>> based on the previous estimate of the size of the table [N]. Each
>> successive ANALYZE will become more accurate. The Vitter algorithm is
>> still useful as [N] will always be an estimate.
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 22:45:23
Message-ID:	87vf94gfb0.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.

Earlier I was discussing the issue of how to measure cross-column
"correlation" with someone who works on precisely these types of problems. He
immediately honed in on precisely this issue of what kinds of deductions can
be extrapolated from a constant sample like this and what kind of deductions
would require a linear sample, or worse: can't be reasonably extrapolated at
all.

After the discussion with him I'm left with a distinctly uneasy feeling about
some of Postgres's stats. I was planning to investigate further before I wrote
about this though, since I'm not sure exactly where they all stand. The
histograms are indeed on solid footing, but I'm not so certain that's true for
some of the other bits of statistics.

The statistical foundation for histograms is the traditional sampling from a
population that supports every opinion poll you see in the news. Those opinion
polls only need a fixed size sample to obtain a given confidence interval for
almost any population size. Postgres's histograms are just describing the
distribution in a similar way.

However for discrete values like the "top ten most common values" and the
"total number of distinct values" it's not so clear at all that you can
extrapolate from a sample at all. And it's certainly not clear that a fixed
size sample gives you at all the same confidence for a large population as it
does for a small population.

If you sampled across the country and found your sample of 600 people had 4
different top choices for president, how do you extrapolate that to calculate
the total number of top choices for president the 300 million+ people will
have across the country? You could multiply by 500k but that's not going to be
right. Hell you're probably better off just going with "4" but that's clearly
not right either.

The reality is that your sample hasn't really given you *any* information on
how frequently you'll find outliers in the distribution. There are probably a
couple hundred or a couple thousand values that occur only once or twice, but
the sample size you would have to use to get a good estimate of how many there
were will be proportional to the entire population, *unlike the histograms*.

The essential point here is that to count "discrete values" you only care
whether they occur or not at all, not how many times they occur. So you lose
the benefit of being able to model the population as "large", the values as
continuous, and the distribution as a normal curve. So you can't really talk
about a 95% confidence interval any more and every outlier becomes as
important as the more common values.

--
greg

From:	pgsql(at)mohawksoft(dot)com
To:	"Alvaro Herrera" <alvherre(at)dcc(dot)uchile(dot)cl>
Cc:	"Bruno Wolff III" <bruno(at)wolff(dot)to>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-07 23:29:18
Message-ID:	16769.24.91.171.78.1107818958.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql(at)mohawksoft(dot)com wrote:
>> > On Mon, Feb 07, 2005 at 13:28:04 -0500,
>> >
>> > What you are saying here is that if you want more accurate statistics,
>> you
>> > need to sample more rows. That is true. However, the size of the
>> sample
>> > is essentially only dependent on the accuracy you need and not the
>> size
>> > of the population, for large populations.
>> >
>> That's nonsense.
>
> Huh, have you studied any statistics?

To what aspects of "statistics" are your referring. I was not a math
major, no, but I did have my obligatory classes as well as algorithms and
so on. I've only worked in the industry for over 20 years.

I've worked with statistical analysis of data on multiple projects,
ranging from medical instruments, compression, encryption, and web based
recommendations systems.

I assume "Huh, have you studied any statistics?" was a call for
qualifications. And yes, some real math major would be helpful in this
discussion because clearly there is a disconnect.

The basic problem with a fixed sample is that is assumes a normal
distribution. If data variation is evenly distributed across a set, then a
sample of sufficient size would be valid for almost any data set. That
isn't what I'm saying. If the data variation is NOT uniformly distributed
across the data set, the sample size has to be larger because there is
"more" data.

I think I can explain with a visual.

I started my career as an electrical engineer and took an experimental
class called "computer science." Sure, it was a long time ago, but bare
with me.

When you look at a sine wave on an oscilloscope, you can see it clear as
day. When you look at music on the scope, you know there are many waves
there, but it is difficult to make heads or tails of it. (use xmms or
winamp to see for yourself) The waves change in frequency, amplitude, and
duration over a very large scale. That's why you use a spectrum analyzer
to go from time domain to frequency domain. In frequency domain, you can
see the trends better.

This is the problem we are having. Currently, the analyze.c code is
assuming a very regular data set. In which case, almost any size sample
would work fine. What we see when we use it to analyze complex data sets,
is that it can't characterize the data well.

The solution is to completely change the statistics model to handle
complex and unpredictably changing trends, or, increase the sample size.

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 01:50:41
Message-ID:	42081AF1.8030403@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com wrote:
>
> In this case, the behavior observed could be changed by altering the
> sample size for a table. I submit that an arbitrary fixed sample size is
> not a good base for the analyzer, but that the sample size should be based
> on the size of the table or some calculation of its deviation.
>
I can see your point, however I wonder if the issue is that the default
stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
maybe we should consider making a higher value (say '100') the default.

> There is no reason why old stats can't be used to create more accurate
> stats. Using succesive analyze operations, we could create better
> statistics for the planner. We can increase the sample size based on the
> table size. We could, I suppose, also calculate some sort of deviation
> statistic so that "n_distinct" can be calculated better with a smaller
> sample set.

The idea of either automatically increasing sample size for large
tables, or doing a few more samplings with different sizes and examining
the stability of the estimates is rather nice, provided we can keep the
runtime for ANALYZE to reasonable limits, I guess :-)
>
> The basic problem, though, is that PostgreSQL performed incorrectly on a
> simple query after indexes were created and analyze performed. Yes, it can
> be corrected, that's what led me to my conclusions, but shouldn't we try
> to devise a better system in the future to improve PostgreSQL so it does
> not need this sort of tuning?
>
Thanks for clarifying.

bets wishes

Mark

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 02:02:54
Message-ID:	87pszbhkq9.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com writes:

> The basic problem with a fixed sample is that is assumes a normal
> distribution.

That's sort of true, but not in the way you think it is.

What's under discussion here, w.r.t. the histograms and the validity of
sampling is really basic statistics. You don't have to be a math major, in
fact math majors tend to avoid the classes that teach this because it's full
of social sciences students :)

It doesn't matter what distribution the variable you're sampling has. In fact
if you think of most applications of sampling, whether a pre-election opinion
polls, or a survey about how often people brush their teeth, you very rarely
find a real normal distribution. Usually the whole point of the survey is to
determine what the distribution is. That's exactly what Postgres is recording
with its histograms.

Where the normal curve comes into play is in measuring the likelihood of the
actual population distribution varying from the sample distribution. The most
likely population distribution is exactly equal to the sample distribution.
distributions slightly different from the sample distribution are slightly
less equal, and the further removed from the sample distribution the less and
less likely. This drops off in precisely the shape of a normal curve.

For large populations this is *always* true, provably, it doesn't matter what
the distribution the raw data or your sample has. The likelihood of the sample
diverging from the population will look like a normal curve.

The stddev of this normal curve represents how precisely you can describe the
distribution of the population based on your sample. The smaller the sample
the larger the stddev of this normal curve, and the larger the probability
that the population diverges from the sample you drew.

To calculate the stddev there's a complex formula that most scientific
calculators have preprogrammed. This function does *not* depend on the
distribution of the raw data (which you don't know in any case)

There are two versions of this function though: one for "small" populations
where the sample is a significant percentage of the population. As you point
out if you're sampling 50% of the entire population then yes you can have more
confidence in your results than .002% of the entire population.

However, once you're talking about populations like an entire state or country
then it's simpler to just calculate the confidence as if the population is
"large". In that case it no longer matters whether the sample is .002% or
.0002% of the population. If it's 600 people then you have the "+/- 3% 19
times out of 20" confidence interval that all those opinion polls you see put
in their fine print. To achieve that they only have to poll 600 people that
whether they're polling a city, a state, or the entire country.

> When you look at a sine wave on an oscilloscope, you can see it clear as
> day. When you look at music on the scope, you know there are many waves
> there, but it is difficult to make heads or tails of it. (use xmms or
> winamp to see for yourself) The waves change in frequency, amplitude, and
> duration over a very large scale. That's why you use a spectrum analyzer
> to go from time domain to frequency domain. In frequency domain, you can
> see the trends better.

That's not a bad analogy to many problems where you're measuring data that has
non-randomness in it but that are not visible in the domain that the
statistics that are being analyzed. This seems to happen a lot with geographic
data, for instance.

If you find that increasing the stats targets improves things then this isn't
true. If you find that it doesn't then what's really needed is a cleverer set
of statistics to look for.

--
greg

From:	Ron Mayer <ron(at)cheapcomplexdevices(dot)com>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 03:15:50
Message-ID:	42082EE6.3070401@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark,

Do you have any evidence that the Sample Size had anything to do
with the performance problem you're seeing?

I also do a lot with the complete Census/TIGER database.

Every problem I have with the optimizer comes down to the
fact that the data is loaded (and ordered on disk) by
State/County FIPS codes, and then queried by zip-code
or by city name. Like this:

Alabama 36101 [hundreds of pages with zip's in 36***]
Alaska 99686 [hundreds of pages with zip's in 9****]
Arizona 85701 [hundreds of pages with zip's in 855**]

Note that the zip codes are *NOT* sequential.

The "correlation" statistic sees that the Zip codes are not
sequential; so it makes the *HORRIBLE* assumption that they
are scattered randomly across the disk.

In reality, even though there's no total ordering of the
zip codes; any given zip code only exists on a couple
disk pages; so index scans would be the right choice.

But the single correlation parameter is not sufficient
to let the optimizer known this.

No matter how large a sample size you choose, ANALYZE
will correctly see that Zip codes and State FIPS codes
are non-correlated, and the optimizer will overestimate
the # of pages an index scan will need.

Ron

PS: I pointed out workarounds in my earlier posting
in this thread. Yes, I'm using the same TIGER data
you are.

From:	pgsql(at)mohawksoft(dot)com
To:	"Ron Mayer" <ron(at)cheapcomplexdevices(dot)com>
Cc:	"Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 13:48:02
Message-ID:	16552.24.91.171.78.1107870482.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com wrote:
>>
>> In this case, the behavior observed could be changed by altering the
>> sample size for a table. I submit that an arbitrary fixed sample size is
>> not a good base for the analyzer, but that the sample size should be
>> based
>> on the size of the table or some calculation of its deviation.
>>
>
> Mark,
>
> Do you have any evidence that the Sample Size had anything to do
> with the performance problem you're seeing?

I have evidence, if you look through some of the messages in this thread,
you'll see how a sample size of 10000 provides enough data points to
create stats the planner can use.

>
> I also do a lot with the complete Census/TIGER database.

Cool, have any code for Mapserver?

>
> Every problem I have with the optimizer comes down to the
> fact that the data is loaded (and ordered on disk) by
> State/County FIPS codes, and then queried by zip-code
> or by city name. Like this:
>
> Alabama 36101 [hundreds of pages with zip's in 36***]
> Alaska 99686 [hundreds of pages with zip's in 9****]
> Arizona 85701 [hundreds of pages with zip's in 855**]
>
> Note that the zip codes are *NOT* sequential.
>
> The "correlation" statistic sees that the Zip codes are not
> sequential; so it makes the *HORRIBLE* assumption that they
> are scattered randomly across the disk.

It is my theory that this is because there are too few data points with
which to properly characterize the nature of the data.

>
> In reality, even though there's no total ordering of the
> zip codes; any given zip code only exists on a couple
> disk pages; so index scans would be the right choice.
I totally agree.
>
>
> But the single correlation parameter is not sufficient
> to let the optimizer known this.
>
> No matter how large a sample size you choose, ANALYZE
> will correctly see that Zip codes and State FIPS codes
> are non-correlated, and the optimizer will overestimate
> the # of pages an index scan will need.
>

I tried to create an analogy in another post, and TIGER is a perfect
example of the analogy.

Think of the difference between an oscilloscope and a spectrum analizer.
The current sampling code works more like an oscilloscope. It assumes a
fairly normalized distribution of data. Given this, it works perfectly
fine.

When a scope is presented with an audio signal, it looks more like
gibberish showing almost no correlation. When you view it in frequency
domain, as with a spectrum analyzer, you can see clear patterns in the
signal.

Now, fortunately, we don't need any sort of absolute visualization of the
data in TIGER, we only need to see that the data has many subtle trends
rather than one fairly evenly distributed one. That's why more samples
works.

If we could do anything, I would add more statistics to the database. A
standard deviation and maybe a sliding window deviation. A standard
deviation might be pretty high, were as a sliding window whould show less
localized deviation. Less localized deviation whould favor index scans in.

Anyway, like I said. I think the expectation that the data is fairly
normalized or evenly distributed works very well for data acquired over
time. It is data like TIGER that is in a multiple field order, i.e. state,
zipr, zipl that has complex paterns for the secondary sorts that can't be
detected with too small a sample.

>
> PS: I pointed out workarounds in my earlier posting
> in this thread. Yes, I'm using the same TIGER data
> you are.
>

Cool.
>

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	pgsql(at)mohawksoft(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 13:57:34
Message-ID:	20050208135734.GY10437@ns.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Mark Kirkwood (markir(at)coretech(dot)co(dot)nz) wrote:
> I can see your point, however I wonder if the issue is that the default
> stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and
> maybe we should consider making a higher value (say '100') the default.

Personally, I think that'd be reasonable.

> The idea of either automatically increasing sample size for large
> tables, or doing a few more samplings with different sizes and examining
> the stability of the estimates is rather nice, provided we can keep the
> runtime for ANALYZE to reasonable limits, I guess :-)

I also agree with this and personally don't mind *too* much if analyze
takes a little while on a large table to get decent statistics for it.
One thing I was wondering about though is if we use the index to
get some of the statistics information? Would it be possible, or
reasonable? Do we already? I dunno, just some thoughts there, I keep
hearing about the number of rows that are sampled and I would have
thought it'd make sense to scan the index for things like the number of
distinct values...

Stephen

From:	pgsql(at)mohawksoft(dot)com
To:	"Greg Stark" <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 14:33:26
Message-ID:	16625.24.91.171.78.1107873206.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>
>> The basic problem with a fixed sample is that is assumes a normal
>> distribution.
>
> That's sort of true, but not in the way you think it is.
>
[snip]

Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like election polling is incorrect.

Election polling assumes a very simple outcome: Some standard ditribution
of a limited number options. I don't think it applies to this.

>
>> When you look at a sine wave on an oscilloscope, you can see it clear as
>> day. When you look at music on the scope, you know there are many waves
>> there, but it is difficult to make heads or tails of it. (use xmms or
>> winamp to see for yourself) The waves change in frequency, amplitude,
>> and
>> duration over a very large scale. That's why you use a spectrum analyzer
>> to go from time domain to frequency domain. In frequency domain, you can
>> see the trends better.
>
> That's not a bad analogy to many problems where you're measuring data that
> has
> non-randomness in it but that are not visible in the domain that the
> statistics that are being analyzed. This seems to happen a lot with
> geographic
> data, for instance.

EXACTLY!!!

>
> If you find that increasing the stats targets improves things then this
> isn't true. If you find that it doesn't then what's really needed is a
> cleverer set of statistics to look for.

I will be the first one to say that increasing the samples is not perfect,
but it is a methodology that will help without major changes in postgres.
Simply increasing the samples to a percentage of the estimated number of
rows (with some upper and lower limits of course) will increase the
accuracy of the "n_distinct" and "correlation" settings (at least a little
bit), and that will make a huge impact with very little work.

If we want to discuss improved statatistics, then we should include a
standard deviation and a sliding window deviation, or something like that.
Hell, maybe even FFT.

The basic problem, I think, is that the sampling mechanism is more like an
oscilloscope looking for large trends instead of a spectrum analyzer
looking for the smaller ones.

We have to be able to tell the planner that adjacent values are less
random even though, as a whole, they are seemingly random.

From:	Bruno Wolff III <bruno(at)wolff(dot)to>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 16:50:08
Message-ID:	20050208165008.GA1768@wolff.to
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 07, 2005 at 17:45:23 -0500,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> However for discrete values like the "top ten most common values" and the
> "total number of distinct values" it's not so clear at all that you can
> extrapolate from a sample at all. And it's certainly not clear that a fixed
> size sample gives you at all the same confidence for a large population as it
> does for a small population.

If you were to keep a complete histogram for the sample points (which may
or may not be practical) you should be able to estimate the number of
distinct values under some assumptions. Such as, that all values outside
of the top N values have the same likelihood. I don't think this is
unreasonable.

>
> If you sampled across the country and found your sample of 600 people had 4
> different top choices for president, how do you extrapolate that to calculate
> the total number of top choices for president the 300 million+ people will
> have across the country? You could multiply by 500k but that's not going to be
> right. Hell you're probably better off just going with "4" but that's clearly
> not right either.

Well you can put some bounds on the number. Since no 5th candidate was
picked by any of the 600 people, you would expect that the number of
people prefering other candidates is on the order of 500000 or less, so
that the number of distinct values is going to be 500000 or less.

I think the histogram idea will work well for estimating the number of
rows with particular values, but if you are interested in the number
of unique values, you are going to have problems with some data sets.
(Ones with a few very common values and lots of extremely rare items.)
In this case there may be some way to use information from indexes on
the data to get better results.

From:	pgsql(at)mohawksoft(dot)com
To:	"Ron Mayer" <ron(at)cheapcomplexdevices(dot)com>
Cc:	"Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-09 13:45:59
Message-ID:	16638.24.91.171.78.1107956759.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote a message caled "One Big trend vs multiple smaller trends in table
statistics" that, I think, explains what we've been seeing.

Sample size is only a bandaid for the issue, however, more samples always
provide more information.

>
> I also do a lot with the complete Census/TIGER database.
>
> Every problem I have with the optimizer comes down to the
> fact that the data is loaded (and ordered on disk) by
> State/County FIPS codes, and then queried by zip-code
> or by city name. Like this:
>
> Alabama 36101 [hundreds of pages with zip's in 36***]
> Alaska 99686 [hundreds of pages with zip's in 9****]
> Arizona 85701 [hundreds of pages with zip's in 855**]
>
> Note that the zip codes are *NOT* sequential.

Again, read "One Big Trend..." and let me know what you think. I think it
describes exactly the problem that we see.

For now, the solution that works for me is to seriously up the value of
"targrows" in analyze.c. It makes it take longer, and while the stats are
not "correct" because they are not designed to detect these sorts of
patterns, a larger sample allows them to be "less wrong" enough to give a
better hint to the planner.

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Ron Mayer <ron(at)cheapcomplexdevices(dot)com>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-13 17:14:58
Message-ID:	Pine.GSO.4.62.0502132001240.21142@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Probably off-topic, but I think it's worth to see what astronomers are
doing with their very big spatial databases. For example, we are working
with more than 500,000,000 rows catalog and we use some special transformation
of coordinates to integer numbers with preserving objects closeness.
I hope we could show postgresql is good enough to be used in astronomy
for very big catalogs. Currently, MS SQL is in use.
See http://www.sdss.jhu.edu/htm/ for details. We use another technique.

Oleg
On Wed, 9 Feb 2005 pgsql(at)mohawksoft(dot)com wrote:

> I wrote a message caled "One Big trend vs multiple smaller trends in table
> statistics" that, I think, explains what we've been seeing.
>
>
>> pgsql(at)mohawksoft(dot)com wrote:
>>>
>>> In this case, the behavior observed could be changed by altering the
>>> sample size for a table. I submit that an arbitrary fixed sample size is
>>> not a good base for the analyzer, but that the sample size should be
>>> based
>>> on the size of the table or some calculation of its deviation.
>>>
>>
>> Mark,
>>
>> Do you have any evidence that the Sample Size had anything to do
>> with the performance problem you're seeing?
>
> Sample size is only a bandaid for the issue, however, more samples always
> provide more information.
>
>
>>
>> I also do a lot with the complete Census/TIGER database.
>>
>> Every problem I have with the optimizer comes down to the
>> fact that the data is loaded (and ordered on disk) by
>> State/County FIPS codes, and then queried by zip-code
>> or by city name. Like this:
>>
>> Alabama 36101 [hundreds of pages with zip's in 36***]
>> Alaska 99686 [hundreds of pages with zip's in 9****]
>> Arizona 85701 [hundreds of pages with zip's in 855**]
>>
>> Note that the zip codes are *NOT* sequential.
>
> Again, read "One Big Trend..." and let me know what you think. I think it
> describes exactly the problem that we see.
>
> For now, the solution that works for me is to seriously up the value of
> "targrows" in analyze.c. It makes it take longer, and while the stats are
> not "correct" because they are not designed to detect these sorts of
> patterns, a larger sample allows them to be "less wrong" enough to give a
> better hint to the planner.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	pgsql(at)mohawksoft(dot)com, Ron Mayer <ron(at)cheapcomplexdevices(dot)com>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-13 21:14:02
Message-ID:	20050213211402.GB52357@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

What's the purpose of doing this transformation? Is it just a means to
sub-divide the dataset? It's very possible that PostGIS would do just as
good a job, without using HTM. Granted, GIS is designed more for working
in LAT/LONG, but I suspect it should work just as well in whatever
coordinate system astronomers use.

Something else to consider is that it would be relatively easy to
imlpement an HTM type in postgresql, which would result in substantial
space savings. You need 3 bits for level 0, each additional level
requires 2 bits. This will be much smaller than storing the HTM in a
varchar, and also smaller than using a bit to indicate N vs S and an
int (or using sign to indicate N/S with an int).

On Sun, Feb 13, 2005 at 08:14:58PM +0300, Oleg Bartunov wrote:
> Probably off-topic, but I think it's worth to see what astronomers are
> doing with their very big spatial databases. For example, we are working
> with more than 500,000,000 rows catalog and we use some special
> transformation
> of coordinates to integer numbers with preserving objects closeness.
> I hope we could show postgresql is good enough to be used in astronomy
> for very big catalogs. Currently, MS SQL is in use.
> See http://www.sdss.jhu.edu/htm/ for details. We use another technique.
>
>
> Oleg
> On Wed, 9 Feb 2005 pgsql(at)mohawksoft(dot)com wrote:
>
> >I wrote a message caled "One Big trend vs multiple smaller trends in table
> >statistics" that, I think, explains what we've been seeing.
> >
> >
> >>pgsql(at)mohawksoft(dot)com wrote:
> >>>
> >>>In this case, the behavior observed could be changed by altering the
> >>>sample size for a table. I submit that an arbitrary fixed sample size is
> >>>not a good base for the analyzer, but that the sample size should be
> >>>based
> >>>on the size of the table or some calculation of its deviation.
> >>>
> >>
> >> Mark,
> >>
> >>Do you have any evidence that the Sample Size had anything to do
> >>with the performance problem you're seeing?
> >
> >Sample size is only a bandaid for the issue, however, more samples always
> >provide more information.
> >
> >
> >>
> >>I also do a lot with the complete Census/TIGER database.
> >>
> >>Every problem I have with the optimizer comes down to the
> >>fact that the data is loaded (and ordered on disk) by
> >>State/County FIPS codes, and then queried by zip-code
> >>or by city name. Like this:
> >>
> >> Alabama 36101 [hundreds of pages with zip's in 36***]
> >> Alaska 99686 [hundreds of pages with zip's in 9****]
> >> Arizona 85701 [hundreds of pages with zip's in 855**]
> >>
> >>Note that the zip codes are *NOT* sequential.
> >
> >Again, read "One Big Trend..." and let me know what you think. I think it
> >describes exactly the problem that we see.
> >
> >For now, the solution that works for me is to seriously up the value of
> >"targrows" in analyze.c. It makes it take longer, and while the stats are
> >not "correct" because they are not designed to detect these sorts of
> >patterns, a larger sample allows them to be "less wrong" enough to give a
> >better hint to the planner.
> >
> >
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 7: don't forget to increase your free space map settings
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Tzahi Fadida <tzahi_ml(at)myrealbox(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-14 03:13:41
Message-ID:	016b01c51243$3599ae20$0b00a8c0@llord
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Just my 2 cents.
I am not a super statistics guy but besides increasing the sample size
and
assumming things on the distribution, I understand you want to get more
info on what distribution the data represents.
usualy the problem with these things is that the data needs to be sorted
on the
index key and also it could take a while, at least for the one time you
want
to find out what is the distribution. Example:
for comulative distributions you need to first sort the data (I am
talking scalars but
probably other keys can work) and run it sequentially thru a
KS(Kolmogorov smirnov) test.
(there are other tests but this is good for general cases)
The test can be against all kind of comulative distributions like
normals,etc...
You then get a feel of how close is the data to the selected
distribution with
a parameter that can be rejected at 0.01, 0.05, 0.1, etc...

Anyway, it can be done, however I am not sure how much better is it over
just plain histograms with random() and uniform dist. Or what happens if
you
just increase the sample size and be done with it.
Again, I am talking about the general/common cases.

Regards,
tzahi.

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of
> pgsql(at)mohawksoft(dot)com
> Sent: Wednesday, February 09, 2005 3:46 PM
> To: Ron Mayer
> Cc: Mark Kirkwood; Tom Lane; Ron Mayer; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)
>
>
> I wrote a message caled "One Big trend vs multiple smaller
> trends in table statistics" that, I think, explains what
> we've been seeing.
>
>
> > pgsql(at)mohawksoft(dot)com wrote:
> >>
> >> In this case, the behavior observed could be changed by
> altering the
> >> sample size for a table. I submit that an arbitrary fixed
> sample size
> >> is not a good base for the analyzer, but that the sample
> size should
> >> be based on the size of the table or some calculation of its
> >> deviation.
> >>
> >
> > Mark,
> >
> > Do you have any evidence that the Sample Size had anything
> to do with
> > the performance problem you're seeing?
>
> Sample size is only a bandaid for the issue, however, more
> samples always provide more information.
>
>
> >
> > I also do a lot with the complete Census/TIGER database.
> >
> > Every problem I have with the optimizer comes down to the fact that
> > the data is loaded (and ordered on disk) by State/County
> FIPS codes,
> > and then queried by zip-code or by city name. Like this:
> >
> > Alabama 36101 [hundreds of pages with zip's in 36***]
> > Alaska 99686 [hundreds of pages with zip's in 9****]
> > Arizona 85701 [hundreds of pages with zip's in 855**]
> >
> > Note that the zip codes are *NOT* sequential.
>
> Again, read "One Big Trend..." and let me know what you
> think. I think it describes exactly the problem that we see.
>
> For now, the solution that works for me is to seriously up
> the value of "targrows" in analyze.c. It makes it take
> longer, and while the stats are not "correct" because they
> are not designed to detect these sorts of patterns, a larger
> sample allows them to be "less wrong" enough to give a better
> hint to the planner.
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>

From:	pgsql(at)mohawksoft(dot)com
To:	"Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>
Cc:	"Ron Mayer" <ron(at)cheapcomplexdevices(dot)com>, "Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-14 14:11:10
Message-ID:	16718.24.91.171.78.1108390270.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Probably off-topic, but I think it's worth to see what astronomers are
> doing with their very big spatial databases. For example, we are working
> with more than 500,000,000 rows catalog and we use some special
> transformation
> of coordinates to integer numbers with preserving objects closeness.
> I hope we could show postgresql is good enough to be used in astronomy
> for very big catalogs. Currently, MS SQL is in use.
> See http://www.sdss.jhu.edu/htm/ for details. We use another technique.

You know, I don't think a lot of people "get" the issues I was describing,
or maybe they don't believe it, I don't know, but, I think that it would
be a useful contrib project to create an 'analyze_special('table',
'column', 'method')' function that does a better job at calculating the
stats for table that contain multiple trend waveforms. A separate function
will probably work well as the trends within the data probably only apply
to specific rows.

It's interesting, because I don't think it needs to calculate a perfect
representation of the data so much as better clue to its nature for the
optimizer.

When I get the time (or can get someone to pay me to do it) I'm going to
try it.

>
>
> Oleg
> On Wed, 9 Feb 2005 pgsql(at)mohawksoft(dot)com wrote:
>
>> I wrote a message caled "One Big trend vs multiple smaller trends in
>> table
>> statistics" that, I think, explains what we've been seeing.
>>
>>
>>> pgsql(at)mohawksoft(dot)com wrote:
>>>>
>>>> In this case, the behavior observed could be changed by altering the
>>>> sample size for a table. I submit that an arbitrary fixed sample size
>>>> is
>>>> not a good base for the analyzer, but that the sample size should be
>>>> based
>>>> on the size of the table or some calculation of its deviation.
>>>>
>>>
>>> Mark,
>>>
>>> Do you have any evidence that the Sample Size had anything to do
>>> with the performance problem you're seeing?
>>
>> Sample size is only a bandaid for the issue, however, more samples
>> always
>> provide more information.
>>
>>
>>>
>>> I also do a lot with the complete Census/TIGER database.
>>>
>>> Every problem I have with the optimizer comes down to the
>>> fact that the data is loaded (and ordered on disk) by
>>> State/County FIPS codes, and then queried by zip-code
>>> or by city name. Like this:
>>>
>>> Alabama 36101 [hundreds of pages with zip's in 36***]
>>> Alaska 99686 [hundreds of pages with zip's in 9****]
>>> Arizona 85701 [hundreds of pages with zip's in 855**]
>>>
>>> Note that the zip codes are *NOT* sequential.
>>
>> Again, read "One Big Trend..." and let me know what you think. I think
>> it
>> describes exactly the problem that we see.
>>
>> For now, the solution that works for me is to seriously up the value of
>> "targrows" in analyze.c. It makes it take longer, and while the stats
>> are
>> not "correct" because they are not designed to detect these sorts of
>> patterns, a larger sample allows them to be "less wrong" enough to give
>> a
>> better hint to the planner.
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 7: don't forget to increase your free space map settings
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	pgsql(at)mohawksoft(dot)com
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-14 17:55:38
Message-ID:	4210E61A.7090503@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pgsql(at)mohawksoft(dot)com wrote:
>
> You know, I don't think a lot of people "get" the issues I was describing,
> or maybe they don't believe it, I don't know, but, I think that it would
> be a useful contrib project to create an 'analyze_special('table',
> 'column', 'method')' function that does a better job at calculating the
> stats for table that contain multiple trend waveforms. A separate function
> will probably work well as the trends within the data probably only apply
> to specific rows.

I've done something similar, but simpler for the Census/TIGER data.

If you loaded each TIGER file sequentially, like I did, the data
was all grouped by county when it was loaded - so basically all
the geographical columns (zip, county, state, census-tract) are
actually grouped tightly on disk -- though ANALYZE can't see this
because they're not strictly ascending or descending.

Since I merely observed the geospatial columns were all
clustered pretty well, I merely set the correlation
value to the same pretty large value for all the
geometric rows with a bunch of statements like this:

update pg_statistic
set stanumbers3[1] = 0.8
where starelid = 31412043
and staattnum=3;

Instead of a complicated analyze function, how about just
letting the user "tell" the optimizer that a column is
clustered well with a function like:

force_correlation_stat('schema', 'table', 'column', 'value')

would actually work well for your data. Since you
know your distinct values lay on a relatively small
number of pages if you merely did:
force_correlation('tiger','rt1','zipl',0.8);
force_correlation('tiger','rt1','statel',0.8);
force_correlation('tiger','rt1','countyl',0.8);
the optimizer would then see that not many disk
pages would need to be hit for a single zip code.

> It's interesting, because I don't think it needs to calculate a perfect
> representation of the data so much as better clue to its nature for the
> optimizer.

Indeed. Using the very arbitrary number "0.8" for the
correlation, for all the geographic-related columns in the
tiger data, the optimizer guessed a good plan almost every
time on my company's 200GB geographical database.

> When I get the time (or can get someone to pay me to do it) I'm going to
> try it.

I still suspect that the correct way to do it would not be
to use the single "correlation", but 2 stats - one for estimating
how sequential/random accesses would be; and one for estimating
the number of pages that would be hit. I think the existing
correlation does well for the first estimate; but for many data
sets, poorly for the second type.

If you want to start a contrib project that looks into additional
stats that may help, I might be interested.

Ron

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	pgsql(at)mohawksoft(dot)com, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-16 02:00:00
Message-ID:	20050216020000.GQ52357@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:

> I still suspect that the correct way to do it would not be
> to use the single "correlation", but 2 stats - one for estimating
> how sequential/random accesses would be; and one for estimating
> the number of pages that would be hit. I think the existing
> correlation does well for the first estimate; but for many data
> sets, poorly for the second type.

Should this be made a TODO? Is there some way we can estimate how much
this would help without actually building it?
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql(at)mohawksoft(dot)com, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-20 19:36:43
Message-ID:	200502201936.j1KJahU15443@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim C. Nasby wrote:
> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
>
> > I still suspect that the correct way to do it would not be
> > to use the single "correlation", but 2 stats - one for estimating
> > how sequential/random accesses would be; and one for estimating
> > the number of pages that would be hit. I think the existing
> > correlation does well for the first estimate; but for many data
> > sets, poorly for the second type.
>
> Should this be made a TODO? Is there some way we can estimate how much
> this would help without actually building it?

I guess I am confused how we would actually do that or if it is
possible.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql(at)mohawksoft(dot)com, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-20 20:30:41
Message-ID:	4218F371.90905@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Jim C. Nasby wrote:
>
>>On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
>>
>>
>>>I still suspect that the correct way to do it would not be
>>>to use the single "correlation", but 2 stats - one for estimating
>>>how sequential/random accesses would be; and one for estimating
>>>the number of pages that would be hit. I think the existing
>>>correlation does well for the first estimate; but for many data
>>>sets, poorly for the second type.
>>
>>
>>Should this be made a TODO? Is there some way we can estimate how much
>>this would help without actually building it?
>
>
> I guess I am confused how we would actually do that or if it is
> possible.
>
I spent a while on the web looking for some known way to calculate
"local" correlation or "clumping" in some manner analogous to how we do
correlation currently. As yet I have only seen really specialized
examples that were tangentially relevant. We need a pet statistician to ask.

regards

Mark

From:	pgsql(at)mohawksoft(dot)com
To:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql(at)mohawksoft(dot)com, "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, "Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-21 05:15:22
Message-ID:	16508.24.91.171.78.1108962922.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Jim C. Nasby wrote:
>> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote:
>>
>> > I still suspect that the correct way to do it would not be
>> > to use the single "correlation", but 2 stats - one for estimating
>> > how sequential/random accesses would be; and one for estimating
>> > the number of pages that would be hit. I think the existing
>> > correlation does well for the first estimate; but for many data
>> > sets, poorly for the second type.
>>
>> Should this be made a TODO? Is there some way we can estimate how much
>> this would help without actually building it?
>
> I guess I am confused how we would actually do that or if it is
> possible.

Bruce, we didn't get much time to talk at Linux world (It was nice to meet
you).

I'm not sure how you would go about it, but in my post "One big trend .."
(In don't even remember anymore), I talk about tables that are physically
sorted on multiple keys, the addresses:

streetname, typename, state, zip.

maple, st, ma, 02186
maple, st, ma, 02186
maple, rd, ma, 02186
maple, ave, ma, 02186
maple, st, me, ??????

Assuming the table is physically sorted by state, town (zip), streetname,
streettype, zip.

If one were to type:

select * from locations where streetname = 'maple';

The analysis of that query improperly minimizes the correlation of the
street address of the table at whole.

From:	Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-21 16:51:10
Message-ID:	mjqzmxxc0w1.fsf@drones.CS.Berkeley.EDU
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sounds a bit like multi-dimensional clustering ...

http://www.research.ibm.com/mdc/

After the ARC experience though ...

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh