Re: B-Tree support function number 3 (strxfrm() optimization)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)
Date: 2014-09-11 21:46:43
Message-ID: CAM3SWZRg0EGCJqb8peQ2F72sgo3sRGbWDrRnFhtuPGkOsjb_zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think I said pretty clearly that it was.

I agree that you did, but it wasn't clear exactly what factors you
were asking me to simulate. It still isn't. Do you want me to compare
the same string a million times in a loop, both with a strcoll() and
with a memcmp()? Should I copy it into a buffer to add a NUL byte? Or
should it be a new string each time, with a cache miss expected some
proportion of the time? These considerations might significantly
influence the outcome here, and one variation might be significantly
less fair than another. Tell me what to do in a little more detail,
and I'll do it (plus let you know what I think of it). I honestly
don't know what you expect.

>> So, yes, it looks like I might have just about regressed this case -
>> it's hard to be completely sure. However, this is still a very
>> unrealistic case, since invariably "len1 == len2" without the
>> optimization ever working out, whereas the case that benefits [2] is
>> quite representative. As I'm sure you were expecting, I still favor
>> pursuing this additional optimization.
>
> Well, I have to agree that doesn't look too bad, but your reluctance
> to actually do the microbenchmark worries me. Granted,
> macrobenchmarks are what actually matters, but they can be noisy and
> there can be other confounding factors.

Well, I've been quite open about the fact that I think we can and
should hide things in memory latency. I don't think my benchmark was
in any way noisy, since you saw 3 100 second runs per test set/case,
with a very stable outcome throughout - plus the test case is
extremely unsympathetic/unrealistic to begin with. Hiding behind
memory latency is an increasingly influential trick that I've seen
crop up a few times in various papers. I think it's perfectly
legitimate to rely on that. But, honestly, I have little idea how much
I actually am relying on it. I think it's only fair to look at
representative cases (i.e. actually SQL queries). Anything else can
only be used as a guide. But, in this instance, a guide to what,
exactly? This is not a rhetorical question, and I'm not trying to be
difficult. If I thought there was something bad hiding here, I'd tell
you about it.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-09-11 21:57:33 Re: bad estimation together with large work_mem generates terrible slow hash joins
Previous Message Alvaro Herrera 2014-09-11 21:23:54 Re: pg_dump refactor patch to remove global variables