Re: sortsupport for text

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sortsupport for text
Date: 2012-06-21 10:40:51
Message-ID: EC561C64-6A13-4FE6-87E1-B13B347A9DF6@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jun21, 2012, at 11:55 , Peter Geoghegan wrote:
> On 21 June 2012 10:24, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Jun21, 2012, at 02:22 , Peter Geoghegan wrote:
>>> I've written a very small C++ program, which I've attached, that
>>> basically proves that this can still make a fairly large difference -
>>> I hope it's okay that that it's C++, but that allowed me to write the
>>> program quickly, with no dependencies for anyone who cares to try this
>>> out, other than a C++ compiler and the standard library.
>>
>> Uh, are you sure the results aren't swapped? string_wrapper takes a
>> parameter called use_strcoll, and it indeed uses strcoll() if that
>> parameter is true, and strxfm() otherwise. But then it passes *false*
>> to determine the speed of strcoll(), and *true* to determine the speed
>> of strxfm()...
>
> Egads, you're right. Apparently in my haste I gave the wrong boolean
> argument to the class constructor in each case. I am completely wrong
> about this. I apologise for the careless mistake. At least I advanced
> the debate, though - we don't want to do any ad-hoc generation of
> strxfrm() output within comparators, even when one side can have a
> cached value.

FWIW, I've played around a bit with a fixed version of your set_test.cpp.
I made it use the cached sort key of the RHS also, made it preserve sort
keys during copy construction, even used a boost::shared_ptr to avoid
actually copying the cached sort key. The strxfrm() version still performed
about 30% worse than the strcoll() version.

From that, I figured that tree inserts don't trigger enough comparisons to
make strxfrm() worthwhile. So I mangled set_test.cpp a bit more and instead
of a set::set, I created an (C-style) array of string_wrapper instances,
and sorted them with std::sort(). Which made strcoll() twice as fast as
strxfrm()...

At this point, my theory is that your choice of "random" strings prevents
strxfrm() from ever winning over strcoll(). The reason being that you pick
each letter uniformly distributed from a-z, resulting in a probability of
two string differing in the first of 1 - 1/26 =~ 96%. Thus, even for
extremely complex collation rules, strcoll() probably only needs to compare
a few characters to determine the order. Whereas strxfrm() has to compute
the whole sort key, no matter what.

The question is thus, how good a model are your "random" strings for the
input of a typical sorting step in postgres? My guess is, a quite good one
actually, since people probably don't deal with a lot of very similar strings
very often. Which makes we wonder if using strxfrm() during sorting wouldn't
be a net loss, all things considered…

My modified set_test.cpp (which should now be called array_test.cpp)
is attached.

best regards,
Florian Pflug

Attachment Content-Type Size
set_test.cpp application/octet-stream 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-06-21 11:15:31 Re: sortsupport for text
Previous Message Peter Geoghegan 2012-06-21 09:55:53 Re: sortsupport for text