Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time

From: Clem Dickey <dickeycl(at)us(dot)ibm(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time
Date: 2011-08-04 01:53:19
Message-ID: j1cu2g$1r76$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 08/03/2011 06:29 AM, Robert Haas wrote:
>> b. the Merge Join cost estimator did a poor job with the data it was given:
>>
>> In function eqjoinsel_inner there are two cases (1) ANALYZE data is
>> available for both sides of the join and (2) ANALYZE data is missing for one
>> or both sides. Due to the GROUP BY processing described above, ANALYZE data
>> was available for "t" but not for "SELECT * FROM t GROUP BY ...".
>>
>> The logic in that case is "use the column with the most distinct values" to
>> estimate selectivity. The default number of distinct values for a column
>> with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
>> was:
>>
>> col in GROUP BY in table t
>> j 200 1
>> k 200 1
>> x 200 10
>> y 200 1000
>> z 200 30
>>
>> In 4 of the 5 columns the default value had more distinct values, and the
>> combined selectivity (chance that two arbitrary rows would have a join
>> match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
>> does not distinguish known numbers from default numbers. A comment in the
>> code acknowledges this:

>
> I'm not sure I understand what you're getting at here, unless the idea
> is to make get_variable_numdistinct() somehow indicate to the caller
> whether it had to punt. That might be worth doing.

Yes, the first step is to make "punt" a separate indicator. The second
would be to make good use of that indicator. As it is now, with "punt"
being a possible data value, there two types of errors:

False negative (code treats DEFAULT_NUM_DISTINCT as ordinary case when
it is special):

I wanted eqjoinsel_inner() to treat "punt" specially: to use the value
from the known side of the JOIN when the other side is unknown. The
current behavior, although not ideal, is the expected use of a default
value.

False positive (code treats DEFAULT_NUM_DISTINCT as special case when it
is ordinary):

eqjoinsel_semi() and estimate_hash_bucketsize() treat
DEFAULT_NUM_DISTINCT specially. This behavior is less defensible than
false positive, since a valid numeric value is being re-used as a flag.

I suggest wrapping the value in a struct (to avoid accidental use) and
using macros for read access.

typedef struct {
double value; // negative means "unknown"
} num_distinct_t;

#define IS_NUM_DISTINCT_DEFINED(nd) ((nd).value >= 0)
#define NUM_DISTINCT_VALUE(nd) ((nd).value)

- Clem Dickey

P.S. Congratulations on displacing MySQL in Mac OS X Lion Server.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jayadevan M 2011-08-04 03:42:28 Re: Parameters for PostgreSQL
Previous Message Robert Klemme 2011-08-03 17:30:46 Re: Performance penalty when using WITH