Re: tweaking NTUP_PER_BUCKET

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: tweaking NTUP_PER_BUCKET
Date: 2014-07-19 22:12:56
Message-ID: 53CAED68.4090104@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19.7.2014 23:07, Tomas Vondra wrote:
> On 19.7.2014 20:28, Tomas Vondra wrote:
> For the first case, a WARNING at the end of estimate_hash_bucketsize
> says this:
>
> WARNING: nbuckets=8388608.00 estfract=0.000001
> WARNING: nbuckets=65536.00 estfract=0.000267
>
> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
>
> For the second case, I get this:
>
> WARNING: nbuckets=16777216.00 estfract=0.000001
> WARNING: nbuckets=262144.00 estfract=0.000100
>
> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
>
> That sounds really strange, because small_table in the second test case
> is almost perfectly unique. And in the first test case, there are only
> very few keys with >2 occurences.

FWIW, the "problem" seems to be this:

/*
* Adjust estimated bucketsize upward to account for skewed
* distribution. */

if (avgfreq > 0.0 && mcvfreq > avgfreq)
estfract *= mcvfreq / avgfreq;

Which is explained like in the estimate_hash_bucketsize comment like this:

* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the "average"
* bucket size for costing purposes should really be taken as something *
close to the "worst case" bucket size. We try to estimate this by
* adjusting the fraction if there are too few distinct data values, and
* then scaling up by the ratio of the most common value's frequency to
* the average frequency.

The problem is that the two testcases don't behave like this, i.e. the
other relation does not have similar distribution (because there the
join key is unique). Actually, I wonder how frequently that happens and
I wouldn't be surprised if it was quite rare.

So maybe we shouldn't scale it the way we scale it now. For example we
could do this:

if (avgfreq > 0.0 && mcvfreq > avgfreq)
estfract *= sqrt(mcvfreq / avgfreq);

Or instead of using the first MCV frequency (i.e. the frequency of the
most common value, which causes the unexpectedly hight tuple/bucket
values), use an average or median of the MCV frequenfies.

Either of these three changes "fixed" the first test case, some fixed
the second one. However it's pure guesswork, and I have how many plans
will be hit negatively by these changes.

What I think might work better is lowering the cost of searching small
hash tables, when the hash table fits into L1/L2... caches. The table I
posted in the first message in this thread illustrates that:

kB NTUP=1 NTUP=2 NTUP=4 NTUP=5 NTUP=9 NTUP=10
1407 6753 7218 7890 9324 9488 12574
7032 9417 11586 17417 15820 25732 25402
35157 13179 17101 27225 24763 43323 43175
70313 14203 18475 29354 26690 46840 46676
175782 14319 17458 25325 34542 37342 60949
351563 16297 19928 29124 43364 43232 70014
703125 19027 23125 33610 50283 50165 81398

So a small hash table (~1.4MB), fitting into L2 (measured on CPU with
~4MB cache) is influenced much less than large tables. I don't know
whether we should detect the CPU cache size somehow, or whether we
should assume some conservative size (say, 4MB sounds OK), or what. But
I think something like this might work

if (hash_size < 4MB) {
/* take in only 10% of the difference */
estfract += (mcvfreq / avgfreq) * estfract * 0.1;
} else if (hash_size > 16MB) {
estfract *= (mcvfreq / avgfreq);
} else {
/* linear approximation between 10 and 100% */
estfract += (mcvfreq / avgfreq) * estfract
* (0.1 + 0.9 * (hash_size - 4MB) / 12MB);
}

Or maybe not, I'm not really sure. The problem is that the two test
cases really don't match the assumption that the outer relation has the
same distribution.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Flower 2014-07-20 00:08:50 Re: Proposal for updating src/timezone
Previous Message Tomas Vondra 2014-07-19 21:07:41 Re: tweaking NTUP_PER_BUCKET