Re: Optimizer improvements: to do or not to do?

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: Optimizer improvements: to do or not to do?
Date: 2006-09-13 21:24:24
Message-ID: 45087708.4060306@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark wrote:
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
>> ...vastly overestimate the number of pages .. because postgresql's guess
>> at the correlation being practically 0 despite the fact that the distinct
>> values for any given column are closely packed on a few pages.
>
> I think we need a serious statistics jock to pipe up with some standard
> metrics that do what we need. Otherwise we'll never have a solid footing for
> the predictions we make and will never know how much we can trust them.

Do we know if any such people participate/lurk on this list, or
if the conversation should go elsewhere?

> That said I'm now going to do exactly what I just said we should stop doing
> and brain storm about an ad-hoc metric that might help:
>
> I wonder if what we need is something like: sort the sampled values by value
> and count up the average number of distinct blocks per value. That might let
> us predict how many pages a fetch of a specific value would retrieve. Or
> perhaps we need a second histogram where the quantities are of distinct pages
> rather than total records.

Either of these sound like they might be an improvement over correlation
itself to estimate the number of pages it'd need to read. Would it be
relatively easy or hard for a programmer not too familiar with the code
to experiment with these ideas? Where would be a good place to look.

> We might also need a separate "average number of n-block spans per value"
> metric to predict how sequential the i/o will be in addition to how many pages
> will be fetched.

I'm wildly guessing that, the # of pages itself seems to be
a bigger factor than the sequential/random nature. For example,
I do a query for data from a particular small city I'd only
need dozens of pages, not many thousands.

OTOH, it'd be neat to know if this were true. Is there any
good way to make something like explain analyze show both
the expected and actual # of pages and # of seeks?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-09-13 21:38:33 Re: Optimizer improvements: to do or not to do?
Previous Message Tom Lane 2006-09-13 21:18:33 Re: Lock partitions