Re: sortsupport for text

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Geoghegan <peter(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sortsupport for text
Date: 2012-06-20 17:35:33
Message-ID: CA+TgmoY7wSQpD82XF73b+DEv-nUobT4bPyap178Z+=FmhB+q8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 20, 2012 at 12:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
>
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?  I'd also note
> that any comparisons you may have seen about this are certainly not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).
>
> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.
> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

On the testing I've done, the strcmp() tie-breaking rarely gets run
anyway. Unless you're sorting data with only a few distinct values,
most comparisons are between values that are distinct under any choice
of collation, and therefore strcoll() returns 0 very rarely, and
therefore the additional runtime it consumes does not matter very
much. Also, it's quite a bit faster than strcoll() anyway, so even
when it does run it doesn't add much to the total time.

I think the elephant in the room here is that we're relying on the OS
to do everything for us, and the OS API we use (strcoll) requires an
extra memcpy and is also dreadfully slow. If we could solve that
problem, it would save us a lot more than worrying about the extra
strcmp(). Of course, solving that problem is hard: we either have to
get the glibc and FreeBSD libc folks to provide a better API (that
takes lengths for each input instead of relying on trailing nul
bytes), or reimplement locales within PG, or store trailing NUL bytes
that we don't really need in the index so we can apply strcoll
directly, none of which are very appealing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-06-20 17:38:25 Re: sortsupport for text
Previous Message Marc Mamin 2012-06-20 17:35:25 Re: Nasty, propagating POLA violation in COPY CSV HEADER