Re: sortsupport for text

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sortsupport for text
Date: 2012-06-14 18:11:10
Message-ID: CA+TgmoYkRLDVORC5OP8xdtsPY4rE380cLw+EHfuxmJdcV4TrcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 1:56 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> On 14 June 2012 17:35, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> The problem with pre-detoasting to save comparison cycles is that you
>> can now fit many, many fewer tuples in work_mem.  There might be cases
>> where it wins (for example, because the entire data set fits even
>> after decompressing everything) but in most cases it seems like a
>> loser.
>
> If I had to guess, I'd say you're probably right about that -
> optimising sorting toasted text doesn't seem like a terribly sensible
> use of your time.
>
> What about the strxfrm suggestion of Greg's? You might find that the
> added benefit of being able to avail of a highly optimised strcmp()
> tipped the balance in favour of that idea, beyond the simple fact that
> there's only a linear number of what you might loosely call "strcoll_l
> units of work" rather than as many as O(n ^ 2). Furthermore, I'd
> speculate that if you were to interlace the strxfrm() calls with
> copying each text string, somewhere like within a specialised
> datumCopy(), that would make the approach more efficient still, as you
> specify a location for the blob in the just-palloc()'d leading-key
> private memory directly, rather than just using memcpy.

Well, it's still got the problem of blowing up memory usage. I just
can't get excited about optimizing for the case where we can consume
10x the memory and still fit in work_mem. If we've got that case, the
sort is gonna be pretty fast anyway. The case where preprocessing
wins is when there are going to be a large number of comparisons
against each tuple - i.e. lg(N) is large. But the cases where we
could pre-transform with strxfrm are those where the data fits in a
small percentage of work mem - i.e. lg(N) is small. I'm open to
somebody showing up with a test result that demonstrates that it's
worthwhile, but to me it seems like it's chasing diminishing returns.

The point of this patch isn't really to improve things for the
collation-aware case, although it's nice that it does. The point is
rather to shave off a double-digit percentage off the time it takes to
do the sort required to build a C-collation index, which is what
people should be using when they don't care about < and >, which most
don't. Despite Tom's concerns, I don't think there's anything in this
patch that can't be fairly easily revised at a later date if we decide
we want a different API. I think it's worth picking the low-hanging
fruit in the meantime.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-06-14 18:28:48 Re: sortsupport for text
Previous Message Peter Geoghegan 2012-06-14 18:10:25 Re: sortsupport for text