From: | "Nathan Boley" <npboley(at)gmail(dot)com> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Vladimir Sitnikov" <sitnikov(dot)vladimir(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: benchmarking the query planner |
Date: | 2008-12-12 01:35:18 |
Message-ID: | 6fa3b6e20812111735s193b4d70lb2864abf279ede9b@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks for the response.
>> Well, ISTM there is a profound difference. For scalarineqsel we care
>> about the total number of values in a bucket. For eqsel we care about
>> the total number of *distinct* values in each bucket
>
> Really?
>
Well, we would obviously also care about the total number of values in
the buckets if we were trying to use the histogram in eqsel.
Isn't a selectivity estimate of x = v as ( the number of values in v's
histogram bucket ) / ( number of distinct values in v's histogram
bucket ) pretty rational? Thats currently what we do for non-mcv
values, except that we look at ndistinct over the whole table instead
of individual histogram buckets.
>> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use
>> the limited storage in pg_statistic to try and estimate the
>> selectivity for ranges of values rather than a single value?
>
> MCVs are useful for questions that are not related to ranges of values
> --- an example of something highly useful we do with them is to try to
> estimate the fraction of a column that satisfies a LIKE or regex
> pattern.
>
Good point. I guess I was responding to the eqsel benchmarks. I should
remember that I don't necessarily know all the places that mcv's are
used.
> In fact, as I was just pointing out to Bruce, we can compute them and
> do useful things with them for datatypes that don't have a defined sort
> order and so the whole concept of "range" is meaningless.
>
Another good point. But don't we have bigger stat problems for
datatypes without an ordering relation? IIRC, analyze doesn't even
fully compute the mcv list, as that would be N^2 in the sample size.
> Now, if your point is that it'd be more flexible to not tie the MCV list
> length to the histogram length, you're right.
No, my point is just the opposite. I think the two should be *more*
tightly linked. It seems that ( at least for eqsel and scalarineqsel )
mcv's should be the values that the histogram can't deal with
effectively. ie, there's no ordering relation, there are too many
values to fit into a histogram bucket, the histogram eqsel estimate
does an especially bad job for a relatively common value, etc. Even
now, if there are 100 histogram buckets then any values that occupy
more than 1% of the table will be mcv's regardless - why force a value
to be an mcv if it only occupies 0.1% of the table? That just bloats
pg_statistic and slows down joins unnecessarily.
I'll submit a patch for 8.5 and then, hopefully, some simple
benchmarks can make my point..
-Nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-12-12 01:50:10 | Re: benchmarking the query planner |
Previous Message | Tom Lane | 2008-12-12 01:24:26 | Re: Mostly Harmless: Welcoming our C++ friends |