Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thom Brown <thom(at)linux(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-09-02 20:41:29
Message-ID: CAM3SWZSxhpL56AEXSzWFVVyxpxT8cBQZVtF7nUgS0tmH-g8Yxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> < Various points on style>

Okay, fair enough.

> n_distinct is a cardinality estimate, but AFAIK not using hyperloglog.
> What is different about that problem vs. this one that makes each
> method correct for its respective case? Or should analyze be using
> hyperloglog too?

HyperLogLog isn't sample-based - it's useful for streaming a set and
accurately tracking its cardinality with fixed overhead. I think that
the first place that it found use was for things like network switches
(aside: it's a pretty recent, trendy algorithm, but FWIW I think that
actually makes sense. It's really quite impressive.). Whereas,
../commands/analyze.c uses an "Estimate [of] the number of distinct
values using the estimator proposed by Haas and Stokes in IBM Research
Report RJ 10025", which is based on sampling (I think that there are
reasons to be skeptical of sample-based cardinality estimators in
general). Right now, given the way ANALYZE does random sampling, we
probably save little to no actual I/O by doing random sampling as
opposed to reading everything. I certainly hope that that doesn't stay
true forever. Simon has expressed interest in working on block-based
sampling. n_distinct is only one ouput that we need to produce as part
of ANALYZE, of course.

So, the way we estimate n_distinct is currently not very good, but
maybe that's inevitable with random sampling. I was surprised by the
fact that you didn't seem to consider it that much of a problem in
your pgCon talk on the planner, since I saw it a couple of times back
when I was a consultant. I haven't seen it that many times, though.
The main practical benefit is that HLL isn't going to give you an
answer that's wildly wrong, and the main disadvantage is that it
expects to observe every element in the set, which in this instance is
no disadvantage at all. There are guarantees around the accuracy of
estimates, typically very good guarantees.

> Is it the right decision to suppress the abbreviated-key optimization
> unconditionally on 32-bit systems and on Darwin? There's certainly
> more danger, on those platforms, that the optimization could fail to
> pay off. But it could also win big, if in fact the first character or
> two of the string is enough to distinguish most rows, or if Darwin
> improves their implementation in the future. If the other defenses
> against pathological cases in the patch are adequate, I would think
> it'd be OK to remove the hard-coded checks here and let those cases
> use the optimization or not according to its merits in particular
> cases. We'd want to look at what the impact of that is, of course,
> but if it's bad, maybe those other defenses aren't adequate anyway.

I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
using Macs to run real database servers. Apple haven't had a server
product in years, and typically people only use Postgres on their Macs
for development. We might as well have coverage of the new code for
the benefit of Postgres hackers that favor Apple machines. Or, to look
at it another way, the optimization is so beneficially that it's
probably worth the risk, even for more marginal cases.

8 primary weights (the leading 8 bytes, frequently isomorphic to the
first 8 Latin characters, regardless of whether or not they have
accents/diacritics, or punctuation/whitespace) is twice as many as 4.
But every time you add a byte of space to the abbreviated
representation that can resolve a comparison, the number of
unresolvable-without-tiebreak comparisons (in general) is, I imagine,
reduced considerably. Basically, 8 bytes is way better than twice as
good as 4 bytes in terms of its effect on the proportion of
comparisons that are resolved only with abbreviated keys. Even still,
I suspect it's still worth it to apply the optimization with only 4.

You've seen plenty of suggestions on assessing the applicability of
the optimization from me. Perhaps you have a few of your own.

> Does the lengthy comment in btsortsupport_worker() survive pgindent?

I'll look into it.

> Reading from an uninitialized byte could provide a valgrind warning
> even if it's harmless, I think.

That wouldn't be harmless - it would probably result in incorrect
answers in practice, and would certainly be unspecified. However, I'm
not reading uninitialized bytes. I call memset() so that in the event
of the final strxfrm() blob being less than 8 bytes (which can happen
even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
every Datum byte if the remaining bytes are always initialized to NUL.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2014-09-02 20:48:44 Re: PL/pgSQL 2
Previous Message Alvaro Herrera 2014-09-02 20:33:01 Re: [BUGS] BUG #10823: Better REINDEX syntax.