Josh Berkus <josh(at)agliodbs(dot)com> writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to the notes in analyze.c, which seem to
think that we're *overestimating* n_distinct.
Well, the notes are there because the early tests I ran on that formula
did show it overestimating n_distinct more often than not. Greg is
correct that this is inherently a hard problem :-(
I have nothing against adopting a different formula, if you can find
something with a comparable amount of math behind it ... but I fear
it'd only shift the failure cases around.