Re: sortsupport for text

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: sortsupport for text
Date: 2012-06-19 14:41:17
Message-ID: CAEYLb_VmhvFvgrvMq3Z43uGzWTejeEigtgH1+qyTZQaqX7dx6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So, just to give a bit more weight to my argument that we should
recognise that equivalent strings ought to be treated identically, I
direct your attention to conformance requirement C9 of Unicode 3.0:

http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode_3_0_0

This "requires that canonically equivalent text be treated
identically. See also C10 and several other places in The Unicode
Standard."

This page, "DETERMINISTIC SORTING", was also interesting:

http://www.unicode.org/notes/tn9/

The same hack that we use in varstr_cmp() appears as psuedo-code,
under "Forcing Deterministic Comparisons". I'm not sure in what
general sense what *they'd* call a non-deterministic comparison (i.e.
plain old strcoll()) is actually non-deterministic. Clearly a
"non-deterministic comparison" is deterministic to the extent that
given the same two binary strings and encoding and collation, the
comparison function will always give the same answer.

The article goes on to describe how we could bolt-on a strxfrm() type
value to a string, to ensure "deterministic comparison" (i.e.
identical behaviour to Postgres master) with pre-processing. Tom did
think that this might still be worth it. I'd rather avoid it.

The article goes on to say of so-called deterministic comparisons:
"However, a deterministic comparison is generally not best practice.
First, it has a certain performance cost in comparison, and a quite
substantial impact on sort key size. (For example, [ICU]
language-sensitive sort keys are generally about the size of the
original string, so appending a copy of the original string generally
doubles the size of the sort key.) A database using these sort keys
can have substantially increased disk footprint and memory footprint,
and consequently reduced performance."

I note also that the The Single UNIX Specification, Version 2 says of
strcoll(): "The strxfrm() and strcmp() functions should be used for
sorting large lists". Why has it taken us this long to research this?
I am aware that Greg Stark suggested it back in 2005, but I doubt that
that was the first time.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-06-19 14:45:42 Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Previous Message Tom Lane 2012-06-19 14:38:56 Re: Do we want a xmalloc or similar function in the Backend?