Re: benchmarking the query planner

From: "Nathan Boley" <npboley(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Vladimir Sitnikov" <sitnikov(dot)vladimir(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 01:35:18
Message-ID: 6fa3b6e20812111735s193b4d70lb2864abf279ede9b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the response.

>> Well, ISTM there is a profound difference. For scalarineqsel we care
>> about the total number of values in a bucket. For eqsel we care about
>> the total number of *distinct* values in each bucket
>
> Really?
>

Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.

Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.

>> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
>> the limited storage in pg_statistic to try and estimate the
>> selectivity for ranges of values rather than a single value?
>
> MCVs are useful for questions that are not related to ranges of values
> --- an example of something highly useful we do with them is to try to
> estimate the fraction of a column that satisfies a LIKE or regex
> pattern.
>

Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.

> In fact, as I was just pointing out to Bruce, we can compute them and
> do useful things with them for datatypes that don't have a defined sort
> order and so the whole concept of "range" is meaningless.
>

Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.

> Now, if your point is that it'd be more flexible to not tie the MCV list
> length to the histogram length, you're right.

No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc. Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.

I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..

-Nathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-12-12 01:50:10 Re: benchmarking the query planner
Previous Message Tom Lane 2008-12-12 01:24:26 Re: Mostly Harmless: Welcoming our C++ friends