Re: Risk Estimation WAS: Planner hints in Postgresql

From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Risk Estimation WAS: Planner hints in Postgresql
Date: 2014-03-18 18:21:47
Message-ID: CAOeZVieVa+V-6NYzLLofG6uXfs9Y7N4cUzeoYrYKaQg23nohuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>
> > On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Yeah. I would like to see the planner's cost estimates extended to
> >> include some sort of uncertainty estimate, whereupon risk-averse people
> >> could ask it to prefer low-uncertainty plans over high-uncertainty ones
> >> (the plans we typically choose for ORDER BY ... LIMIT queries being
> great
> >> examples of the latter). But it's a long way from wishing that to
> making
> >> it so. Right now it's not even clear (to me anyway) how we'd measure or
> >> model such uncertainty.
>
>

I have been thinking of some ways to have a risk estimate of each
selectivity that our planner gives. I think a way to do it is as follows:

One of the factors that leads to bad estimates is that the histogram of the
values of a column maintained by the planner gets old by time and the data
in the column changes. So, the histogram is no longer a quite accurate view
of the data and it leads to bad selectivity.

One thing we can try to do is to add a factor of error that we feel the
selectivity given can have. This allows us to factor in the probability
that the data changed and the estimate of the difference of the current
histogram and the histogram of the actual data currently present in the
column in the table.

We can use Central Limit Theorem (
http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the
theorem says is that given a distribution that has finite variance and
finite mean, we can take random independent samples from the data and
calculate the standard deviation and the mean of the sample. If we have
large enough number of samples and if we plot the mean and SD, they would
follow a normal distribution.

What is interesting is that this can allow us to predict the SD of a given
dataset from the curve and the SD should be directly proportional to the
deviation it has from the given planner histogram.

I am no mathematician hence its hard for me to explain. I think this link
[1] will be more helpful.

So, we can have a probability value for the random variable and that shall
model the confidence we have in our estimate.

I may be wrong in some parts but I hope I have been able to convey the
general idea.

If this idea develops, I shall be happy to work on this but my hands are
full in ROLLUPS right now, so for my part it shall take time. I just want
to float the idea and get a general feel about the idea right now.

Please let me know your comments and feedback on this.

Regards,

Atri

[1]: http://www.theriac.org/DeskReference/viewDocument.php?id=177
--
Regards,

Atri
*l'apprenant*

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-03-18 18:27:33 Re: Portability issues in shm_mq
Previous Message Bruce Momjian 2014-03-18 18:18:50 Re: pg_archivecleanup bug