Re: sortsupport for text

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)mit(dot)edu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: sortsupport for text
Date: 2012-06-18 15:59:53
Message-ID: CAEYLb_Uw4_Z0Vfb8+VR1+iM2zJ-c7YCV=sQRT1J4CXK+sAkR+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 June 2012 00:38, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The only reason we test "a = b" and not "a < b || a > b"
> is that the latter is at least twice as expensive to evaluate.

Perhaps I've been unclear. I meant to write "(!(a < b) && !(b < a))",
and not "(a < b && b < a)". The former is not tautological when the
trichotomy law holds, but the latter is. My mistake - I was a little
tired when I wrote that (though you seemed to know what I meant).

You can sort things just fine if you only have a less than comparator
than returns a boolean, which is what I sought to illustrate with that
example: There is no requirement that the comparator indicate anything
more than whether one element is less than the other. In particular,
it need not tell the sort function that two elements are equal, though
the sort function will still (almost invariably in practice) put equal
elements beside each other. Consider std::set, a highly-generic
template class that enforces uniqueness, usually implemented using a
red-black tree (so its elements are sorted), that only requires that
the datatype it is instantiated with have an operator<. The datatype
used may not even have an operator== for std::set to consult if it
wanted to.

So, in principle, std::set could figure out if any two given elements
were equivalent. That is to say, it could determine:

if (!(a < b) && !(b < a))

It could not, even in principle given its implicit
interface/requirements for datatypes, know for sure that:

if (a == b)

Despite the fact that equivalency and equality can be expected to
coincide the vast majority of the time, they are not quite the same
thing. It's perfectly reasonable that it might actually not hold that
a and b are equal, even though they are equivalent, for certain
datatypes. Certainly not for integers. But, for example, it could make
sense to have a string datatype where with a certain locale a certain
Hungarian word was equivalent to the same string with a minor
variation in diacritical marks or whatever (an entirely culturally
defined equivalency), even though it was not equal according to this
datatype's own idiosyncratic notion of equality, which is bitwise
equality.

That would be a datatype that std::set would be entirely capable of
rolling with, because it doesn't care about equality. It might be
better if our text type had the same semantics as this hypothetical
string datatype, because:

1. A culture has gone so far as to make these variations of
diacritical marks or whatever equivalent. The Unicode consortium
consulted with the Hungarian government, or maybe the Hungarian
version of the Académie française, and they asked for this, which
seems pretty serious to me. Maybe they'd like to have a unique
constraint reject what they consider to be equivalent strings. If this
sounds pedantic to you, google "Han unification controversy", because
this sounds like that in reverse (Hungarian separation controversy).
The fact that when a unique constraint is violated, it isn't actually
necessary to check the equality operator function (equivalence will
do; no need to call texteq as when selecting with the same value in
the qual) suggests that this wouldn't be too hard to facilitate,
though again, I admit my understanding here is more shallow than I'd
like. Now, I'll grant you that I'm not currently losing any sleep
about our cultural insensitivity, and apparently neither are any
Hungarians; this just serves to illustrate that my position is The
Right Thing (I hope).

Here is a description of the Hungarian problem:

http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx

It says " the feature is such that since dzs is a compression within
Hungarian, that if you see ddzs it should be treated as equivalent to
dzsdzs...Basically, the feature is such that since dzs is a
compression within Hungarian, that if you see ddzs it should be
treated as equivalent to dzsdzs.".

We're not the first people to have to deal with this:

http://bugs.mysql.com/bug.php?id=12519

Here is Tom's original analysis, that lead to this patch:

http://archives.postgresql.org/pgsql-general/2005-12/msg00803.php

2. This way, it's still possible to do the strxfrm() optimisation more
or less for free, without breaking hash methods on text. That is not
to be sniffed at - sorting text is very common and important. Most
people don't know about the C locale, and couldn't use it if they did.

> The last section of src/backend/access/nbtree/README has some notes
> that you might find relevant.

I assume that you're referring to "Notes to Operator Class
Implementors". It says:

"""
An = operator must be an equivalence relation; that is, for all non-null
values A,B,C of the datatype:

A = A is true reflexive law
if A = B, then B = A symmetric law
if A = B and B = C, then A = C transitive law
"""

This would still hold; we'd just have to change the = operators here
to something representing equivalency, I suppose (or, more likely,
don't bring up equivalency .Vs equality here for pedagogical reasons).
If it didn't hold, than it would be impossible to write a strxfrm()
implementation, since the two variations of the textually distinct
though equivalent Hungarian strings would produce identical strxfrm()
blobs that were on the one hand bitwise identical, but on the other
not reflexive, symmetric or transitive when bitwise compared. This is
obviously a logical contradiction.

So, we'd still be using a comparison that returns an integer under the
hood, but we'd interpret 0 as equivalency and not equality.

>> Simple question: if you were to just remove the strcmp tie-breaker for
>> () in varstr_cmp(), but not touch anything else, would Postgres
>> exhibit objectively incorrect behaviour?
>
> Yes, it would, and did, before we put that in; see the archives for the
> discussions that led up to the patch you mentioned earlier.

It wouldn't be the same though, because you also changed the
definition of texteq, textne, bpchareq, bpcharne in those commits (to
just use strncmp(), not varstr_cmp()), and I am not suggesting that
you consider changing them back too.

Incidentally, I should mention that I tried removing the strcmp()
tie-breaker in varstr_cmp() on master, and unsurprisingly the
regression tests don't break.

Now, there might be a better way of explaining all of this that is
more clear and succinct, and that's what I'll attempt to do:

tuplesort.c tells lies to the btree code: That two index tuples that
are logically equal are not. Now, the btree code doesn't "repeat"
those lies to anyone else, like the user, which is unsurprising
because that would lead to incorrect answers. Like all good liars, it
is consistent in the lies that it tells (because the lies are derived
from the item-pointers of index tuples, which are used as a
tie-breaker).

I'm tentatively suggesting that it might be to our advantage to tell a
similar though different lie to the btree code that it would never
have to know about, a lie originating higher-up and reverberating more
places, within the comparator proper ( that is, varstr_cmp() ), rather
than the index-tuple encapsulating comparator within tuplesort (that
is, comparetup_index_btree() ): That two index tuples are equal, when
in-fact it might just be that they're equivalent.

Now, for most people, this is exactly the same thing since every
datatype except text doesn't have a separate notion of equivalence,
and text basically only has that notion when using certain locales
like hu_HU.UTF-8 (hardly surprising that the regression tests didn't
fail on me).

The Hungarians get to have uniqueness enforced for textually distinct
variants that strcoll() returns 0 for (equivalent not equal values) -
in other words, they get what they asked for. However, they still must
specify the correct exact variation in predicates, since the equality
operator only cares about bitwise equality, which is consistent with
general Postgres convention. I believe that that could be the most
correct behaviour for them.

So, the original complainant saw this behaviour, a clear violation of
the trichotomy law:

mage=# select 'potyty'::varchar = 'potty'::varchar;
?column?
----------
f
(1 row)

mage=# select 'potyty'::varchar < 'potty'::varchar;
?column?
----------
f
(1 row)

mage=# select 'potyty'::varchar > 'potty'::varchar;
?column?
----------
f
(1 row)

The thing is that I think that the above user-visible behaviour might
actually be desirable. 'potyty'::varchar is* not* bitwise equal (our
general definition of text equality) to 'potty'::varchar. So, suppose
there was an equivalency operator, say ===, all would be right with
the world:

mage=# select 'potyty'::varchar === 'potty'::varchar;
?column?
----------
t
(1 row)

mage=# select 'potyty'::varchar < 'potty'::varchar;
?column?
----------
f
(1 row)

mage=# select 'potyty'::varchar > 'potty'::varchar;
?column?
----------
f
(1 row)

The reason everything wasn't right with the world, and this example
got screwed up for the complainant at the time was that texteq() still
returned false via the len1 != len2 fastpath - otherwise Tom's
testcase here would not have shown a problem. So at the time and under
the exact circumstances, the '=' operator was actually
equality-orientated (if only by accident, because len1 != len2 here),
whereas varstr_cmp() was entirely equivalency-orientated. So that
explains this test-case of Tom's. It does not explain the confusing
behaviour that led to the complaint though, since few people are in
the habit of verifying that the trichotomy law holds just for the fun
of it.

online=# select * from common_logins where username = 'potyty';
uid | username | password | lastlogin | status | usertype | loginnum
-----+----------+----------+-----------+--------+----------+----------
(0 rows)

online=# select * from common_logins where username like 'potyty';
uid | username | password | lastlogin | status | usertype | loginnum
--------+----------+----------+----------------------------+--------+----------+----------
155505 | potyty | board | 2004-08-16 17:45:55.723829 | A | S | 1
60067 | potyty | board | 2004-07-07 20:22:17.68699 | A | S | 3
174041 | potyty | board | 2005-02-17 00:00:13.706144 | A | S | 3
(3 rows)

online=# select username, username = 'potyty' from common_logins where
username like 'potyty';
username | ?column?
----------+----------
potyty | t
potyty | t
potyty | t
(3 rows)

That isn't something that I've been able to figure out yet. I think
that maybe an index got corrupted because

Incidentally, I should point out that right now this code appears in
like_match.c:

MatchText(char *t, int tlen, char *p, int plen,
pg_locale_t locale, bool locale_is_c)

The only way that the locale and locale_is_c are used within this
function is via a recursive call to MatchText() - they're entirely
vestigial, it seems.

Now, again I must admit that it isn't obvious to me that my white lie
to the btree code about something being equal that is actually just
equivalent won't come back to haunt me - lies have short legs. On the
face of it though, it looks like I might get away with it, since
equality in a qual is re-checked when I execute a select statement,
but isn't checked when I attempt to violate a unique constraint (a
behavioural difference which the Hungarians will be happy about, since
they took the unusual step of representing that the strings 'potyty'
and 'potty' were exactly equivalent to the appropriate authorities).

Perhaps more importantly, I cannot recreate any of these problems on
my Fedora 16 machine. Even with hu_HU on LATIN2, Tom's original test
case (from 2005, on a Fedora 4 machine) cannot be recreated. So it may
be that they've tightened these things up in some way. It's far from
clear why that should be.

It could be worth

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-06-18 16:04:39 Re: sortsupport for text
Previous Message Andres Freund 2012-06-18 15:52:28 Re: [PATCH 02/16] Add zeroRecPtr as a shortcut for initializing a local variable to {0, 0}