Re: 8.4 optimization regression?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: PGSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: 8.4 optimization regression?
Date: 2011-08-31 22:21:51
Message-ID: 5156.1314829311@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> writes:
> [ assorted examples showing that commit
> 7f3eba30c9d622d1981b1368f2d79ba0999cdff2 has got problems ]

Thanks for the test cases. After playing with these for a bit I believe
I've figured out the error in my previous thinking. Clamping the
ndistinct value like that can improve matters when applied to the inside
relation of a semi or anti join, but in all other cases it's just wrong.
If you think about what is happening in eqjoinsel_inner with the patch,
we are reducing the ndistinct estimate for the join key column
proportionally to the selectivity of whatever baserel restrictions
apply. This then results in proportionally increasing the selectivity
number for the join condition --- in other words, we're more or less
cancelling out the effects of one or the other relation's base
restrictions. So that's pretty broken in general. The reason it is
important for semi/antijoin inner relations is that this is actually the
only way that restrictions applied to the inner rel get to impact the
join size estimate at all, since set_joinrel_size_estimates is not going
to factor the inner rel size into what it multiplies the join selectivity
against.

In short, I was mistakenly extrapolating from the observation that it
helped to hack the ndistinct estimate for a semijoin's inner rel, to
the conclusion that we should do that for all join input rels.

So, not only are you correct that we should revert the changes to
eqjoinsel_inner, but what's happening in eqjoinsel_semi is wrong too.
It should only be clamping the ndistinct value for the inner side.
And I think it needs to be taking that into account for the case where
it does have MCVs as well as the case where it doesn't.

So I'll go back to this with hopefully a clearer picture of what's
happening. Thanks again for the test cases.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2011-09-01 00:12:53 Re: issue related to logging facility of postgres
Previous Message Alan Hodgson 2011-08-31 20:05:16 Re: Slow performance