Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Bruno Wolff III" <bruno(at)wolff(dot)to>
Cc: pgsql(at)mohawksoft(dot)com, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 22:16:56
Message-ID: 16759.24.91.171.78.1107814616.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Feb 07, 2005 at 13:28:04 -0500,
>
> What you are saying here is that if you want more accurate statistics, you
> need to sample more rows. That is true. However, the size of the sample
> is essentially only dependent on the accuracy you need and not the size
> of the population, for large populations.
>
That's nonsense.

If your total data size is 100 elements in a set, then a sample size of
100 elements will cover 100% of your data.

If your total data size is 10,000 elements in a set, the a sample size of
100 elements will cover 1% of your data.

In the case of the TIGER database, the base of 100 samples is about .002%
0f the data is sampled. Think about that, that is an average of 1 sample
about every 50,000 records. You could have substantial but irregular
trends in the data that may never get detected, and this is EXACTLY what
we see. If we increase the sample size (targrows), the statistics suddenly
work better.

For instance, look at the data below.

The first analyze / select from pg_stats is with an analyze of 3000
samples. The zipl and zipr columns get calculated poorly and can cause the
planner to use a table scan instead of an index scan.

The second analyze / select from the pg_stats is with an analyse of 10000
samples. The zipl and zipr n_distinct values are still off by a factor of
10, but close enough for the planner to deal.

If the premise is that samples size doesn't make a difference, I think
we've proved that this is not true.

tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 3000 of 1527360 pages, containing 90978 live rows
and 0 dead rows; 3000 rows in sample, 46318719 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs
| histogram_bounds
| correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.672 | 4 | 960 |
{76240,52601,55746,71730,74604,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
|
{1085,16652,28206,33412,43147,49428,58801,68110,77515,91340,99006} |
-0.119519
public | rt1 | zipr | 0.677 | 4 | 960 |
{76240,52601,55746,71730,74604,78577,92705,93117,95818} |
{0.00166667,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333,0.00133333}
| {962,15613,28572,33606,43545,49428,60423,68064,77040,91340,99006} |
-0.104158
(2 rows)

Now this:
tiger=# analyze verbose rt1;
INFO: analyzing "public.rt1"
INFO: "rt1": scanned 10000 of 1527360 pages, containing 303419 live rows
and 0 dead rows; 10000 rows in sample, 46343004 estimated total rows
ANALYZE

tiger=# select * from pg_stats where tablename = 'rt1' and attname like
'zip%';
schemaname | tablename | attname | null_frac | avg_width | n_distinct |
most_common_vals |
most_common_freqs |
histogram_bounds | correlation
------------+-----------+---------+-----------+-----------+------------+---------------------------------------------------------------+-------------------------------------------------------------------------+-------------------------------------------------------------------+-------------
public | rt1 | zipl | 0.6807 | 4 | 2942 |
{61832,13090,17404,30907,31204,45342,47714,63050,80918,93726} |
{0.0008,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{654,15018,28208,33870,43006,49008,59741,68803,78640,92105,99687} |
-0.137744
public | rt1 | zipr | 0.684 | 4 | 2921 |
{13090,61832,30907,31204,45342,47714,63050,70122,80918,93726} |
{0.0006,0.0006,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005,0.0005} |
{731,14824,27871,33324,42276,48895,58401,68338,78575,92105,99654} |
-0.140663
(2 rows)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2005-02-07 22:31:33 Re: Query optimizer 8.0.1 (and 8.0)
Previous Message Joshua D. Drake 2005-02-07 22:02:54 PHP/PDO Database Abstraction Layer