Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Alvaro Herrera" <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: "Bruno Wolff III" <bruno(at)wolff(dot)to>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 23:29:18
Message-ID: 16769.24.91.171.78.1107818958.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Mon, Feb 07, 2005 at 05:16:56PM -0500, pgsql(at)mohawksoft(dot)com wrote:
>> > On Mon, Feb 07, 2005 at 13:28:04 -0500,
>> >
>> > What you are saying here is that if you want more accurate statistics,
>> you
>> > need to sample more rows. That is true. However, the size of the
>> sample
>> > is essentially only dependent on the accuracy you need and not the
>> size
>> > of the population, for large populations.
>> >
>> That's nonsense.
>
> Huh, have you studied any statistics?

To what aspects of "statistics" are your referring. I was not a math
major, no, but I did have my obligatory classes as well as algorithms and
so on. I've only worked in the industry for over 20 years.

I've worked with statistical analysis of data on multiple projects,
ranging from medical instruments, compression, encryption, and web based
recommendations systems.

I assume "Huh, have you studied any statistics?" was a call for
qualifications. And yes, some real math major would be helpful in this
discussion because clearly there is a disconnect.

The basic problem with a fixed sample is that is assumes a normal
distribution. If data variation is evenly distributed across a set, then a
sample of sufficient size would be valid for almost any data set. That
isn't what I'm saying. If the data variation is NOT uniformly distributed
across the data set, the sample size has to be larger because there is
"more" data.

I think I can explain with a visual.

I started my career as an electrical engineer and took an experimental
class called "computer science." Sure, it was a long time ago, but bare
with me.

When you look at a sine wave on an oscilloscope, you can see it clear as
day. When you look at music on the scope, you know there are many waves
there, but it is difficult to make heads or tails of it. (use xmms or
winamp to see for yourself) The waves change in frequency, amplitude, and
duration over a very large scale. That's why you use a spectrum analyzer
to go from time domain to frequency domain. In frequency domain, you can
see the trends better.

This is the problem we are having. Currently, the analyze.c code is
assuming a very regular data set. In which case, almost any size sample
would work fine. What we see when we use it to analyze complex data sets,
is that it can't characterize the data well.

The solution is to completely change the statistics model to handle
complex and unpredictably changing trends, or, increase the sample size.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2005-02-07 23:44:42 PHP/PDO Database Abstraction Layer
Previous Message Simon Riggs 2005-02-07 23:23:43 Re: Thinking about breaking up the BufMgrLock