Re: benchmarking the query planner

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Nathan Boley" <npboley(at)gmail(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Vladimir Sitnikov" <sitnikov(dot)vladimir(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: benchmarking the query planner
Date: 2008-12-12 16:25:10
Message-ID: 49423C06.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>> "Nathan Boley" <npboley(at)gmail(dot)com> wrote:
> Can anyone suggest a good data set to test this sort of question on?

Where we have the biggest problem with bad estimates is on complex
searches involving many joins where the main criterion limiting the
result set is a name. The estimate based on the histogram is often
very low (e.g. 2) when the actual result set is several hundred.
While several hundred is far short of 1% of the table, the best plan
for a result set of that size is very different than the best plan for
two rows.

Some numbers follow to give an idea of the shape of data where current
techniques sometimes do poorly. We use a "searchName" column which
puts the name components from various columns into a canonical format;
this is what is indexed and searched. The search is usually a LIKE
with the high order portion being six to ten characters followed by
the wild card.

Total rows in table: 32,384,830

There are 9,958,969 distinct values.

There is one value present in over 1% of the rows, with 433,578 rows.

There are ten values present in over 0.1% of the rows:
433578
140398
135489
112088
64069
63158
44656
36499
35896
35819

The 100th most common value is present in 4847 rows.

There are 186 rows with over 0.01% of the rows.

Based on my experience, we would need better estimates for ranges with
200 to 300 rows to improve our plans for the problem cases. I'd be
happy to have it scan the whole table during our nightly VACUUM
ANALYZE if that would get me statistics which would improve the
estimates to that degree without a huge increase in plan time.

Which raises the issue, if we could get better statistics by passing
the whole table, why not do that when VACUUM ANALYZE is run?

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-12-12 16:31:10 Re: WIP: default values for function parameters
Previous Message Tom Lane 2008-12-12 16:16:16 Re: benchmarking the query planner