Re: sortsupport for text

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sortsupport for text
Date: 2012-03-19 21:23:05
Message-ID: 20120319212304.GA15141@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 19, 2012 at 12:19:53PM -0400, Robert Haas wrote:
> On Sat, Mar 17, 2012 at 6:58 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> > On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> 12789    28.2686  libc-2.13.so             strcoll_l
> >> 6802     15.0350  postgres                 text_cmp
> >
> > I'm still curious how it would compare to call strxfrm and sort the
> > resulting binary blobs. I don't think the sortsupport stuff actually
> > makes this any easier though. Since using it requires storing the
> > binary blob somewhere I think the support would have to be baked into
> > tuplesort (or hacked into the sortkey as an expr that was evaluated
> > earlier somehow).
>
> Well, the real problem here is that the strxfrm'd representations
> aren't just bigger - they are huge. On MacBook Pro, if the input
> representation is n characters, the strxfrm'd representation is 9x+3
> characters.

Ouch. I was holding out hope that you could get a meaningful
improvement if we could use the first X bytes of the strxfrm output so
you only need to do a strcoll on strings that actually nearly match.
But with an information density of 9 bytes for one 1 character it
doesn't seem worthwhile.

That and this gem in the strxfrm manpage:

RETURN VALUE
The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating
'\0' character. If the value returned is n or more, the
contents of dest are indeterminate.

Which means that you have to take the entire transformed string, you
can't just ask for the first bit. I think that kind of leaves the whole
idea dead in the water.

Just for interest I looked at the ICU API for this and they have the
same restriction. There is another function which you can use to
return partial sort keys (ucol_nextSortKeyPart) but it produces
"uncompressed sortkeys", which it seems is what Mac OSX is doing, which
seems useless for our purposes. Either this is a hard problem or we're
nowhere near a target use case.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2012-03-19 21:45:05 Re: Regarding column reordering project for GSoc 2012
Previous Message Tareq Aljabban 2012-03-19 21:14:53 Re: Storage Manager crash at mdwrite()