Re: [SQL] Comparison semantics of CHAR data type

From: Thomas Fanghaenel <tfanghaenel(at)salesforce(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [SQL] Comparison semantics of CHAR data type
Date: 2013-10-14 17:57:37
Message-ID: CAK+WP1x5NOXXoh36cEcT=JRZ=_a7=53cb66TfAzMM9DnzFqURw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-sql

On Fri, Oct 11, 2013 at 3:44 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> You have some good questions here, though there are two interrelated
> things going on here. First is collation, and the second is the
> trimming of spaces from char() comparisons.

Sorry, I should have probably mentioned more explicitly that I'm only
concerned about the latter. I might get yelled at for saying this, but
I think the rather ubiquitous right-trimming of bpchar values is wrong,
and does - in combination with memcmp() for comparison - lead to some
non-standard behavior.

> Now, on to the trailing space issue using the default TEXT value for
> strings, first in UTF8:
>
>
> test=> SHOW lc_collate;
> lc_collate
> -------------
> en_US.UTF-8
> (1 row)
>
> test=> select E'ab\n' < E'ab ';
> ?column?
> ----------
> f
> (1 row)
>
> then in "C":
>
> test2=> SHOW lc_collate;
> lc_collate
> ------------
> C
> (1 row)
>
> test2=> select E'ab\n' < E'ab ';
> ?column?
> ----------
> t
> (1 row)
>
> This matches the \n/space issue we saw above. Now, here is where
> CHAR() starts to show the unusual behavior you saw, first in UTF8:
>
> test=> SHOW lc_collate;
> lc_collate
> -------------
> en_US.UTF-8
> (1 row)
>
> test=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
> ?column?
> ----------
> f
> (1 row)
>
> then in C:
>
> test2=> SHOW lc_collate;
> lc_collate
> ------------
> C
> (1 row)
>
> test2=> select E'ab\n'::CHAR(10) < E'ab '::CHAR(10);
> ?column?
> ----------
> f
> (1 row)
>
> You can see the UTF8 case is fine because \n is considered greater
> than space, but in the C locale, where \n is less than space, the
> false return value shows the problem with
> internal_bpchar_pattern_compare() trimming the string and first
> comparing on lengths. This is exactly the problem you outline, where
> space trimming assumes everything is less than a space.

For collations other than C some of those issues that have to do with
string comparisons might simply be hidden, depending on how strcoll()
handles inputs off different lengths: If strcoll() applies implicit
space padding to the shorter value, there won't be any visible
difference in ordering between bpchar and varchar values. If strcoll()
does not apply such space padding, the right-trimming of bpchar values
causes very similar issues even in a en_US collation.

For example, this seems to be the case on OS X:

select 'ab '::char(10) collate "en_US" < E'ab\n'::char(10)
collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::char(10) collate "C" < E'ab\n'::char(10) collate "C";
?column?
----------
t
(1 row)

select 'ab '::varchar(10) collate "en_US" <
E'ab\n'::varchar(10) collate "en_US";
?column?
----------
f
(1 row)

select 'ab '::varchar(10) collate "C" < E'ab\n'::varchar(10)
collate "C";
?column?
----------
f
(1 row)

So here there's actually not only the same \n/space issue as in the C
collation (which would go away if the bpchar value weren't trimmed).
It also shows that there might be slight differences in behavior,
depending which platform you're running on.

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> What matters in general isn't where the characters fall when comparing
> individual bytes, but how the strings containing them sort according
> to the applicable collation. That said, my recollection of the spec
> is that when two CHAR(n) values are compared, the shorter should be
> blank-padded before making the comparison.

Not necessarily. The SQL Standard actually ties this to the collation
sequence that is in use. Without a lot of context, this is from
Subclause 8.2, "<comparison predicate>", General Rule 3)b):

b) If the length in characters of X is not equal to the length in
characters of Y, then the shorter string is effectively replaced,
for the purposes of comparison, with a copy of itself that has been
extended to the length of the longer string by concatenation on the
right of one or more pad characters, where the pad character is
chosen based on CS. If CS has the NO PAD characteristic, then the
pad character is an implementation-dependent character different
from any character in the character set of X and Y that collates
less than any string under CS. Otherwise, the pad character is a
<space>.

In my opinion, that's just a lot of handwaving, to the extent that in
practice different vendors interpret this clause differently. It seems
that SQLServer and DB2 do PAD semantics across the board, whereas Oracle
has uses NO PAD semantics whenever there's a VARCHAR type involved in
the comparison.

But all that is actually a whole different can of worms, and slightly
besides the point of my original question. How to properly compare
strings with different lentgths has been discussed before, see for
instance the thread in [1]. My intention was not to get that started
again. As far as I can see, the consensus seems to be that when using
the C locale, string comparisons should be done using NO PAD semantics.
(It sure gives some strange semantics if you have varchars with trailing
spaces, but it's perfectly legal.)

The point is that my testcase deals with strings of the same length.
Thus, the above clause doesn't really apply. The standard, to my
understanding, says that fixed-length character values are padded when
the row is constructed. And once that happens, those spaces become part
of the value. It's invalid to strip them, unless done explicitly.

> Since we only have the CHAR(n) type to improve compliance with the SQL
> specification, and we don't generally encourage its use, I think we
> should fix any non-compliant behavior. That seems to mean that if you
> take two CHAR values and compare them, it should give the same result
> as comparing the same two values as VARCHAR using the same collation
> with the shorter value padded with spaces.
>
> So this is correct:
>
> test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
> ?column?
> ----------
> t
> (1 row)
>
> ... because it matches:
>
> test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
> ?column?
> ----------
> t
> (1 row)

Again, this touches on the same point as Bruce's example above. Right
now these two queries might produce identical results on Linux, because
of the way strcoll() behaves. On OS X you get different results:

select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
?column?
----------
t
(1 row)

select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3)
collate "en_US";
?column?
----------
f
(1 row)

I have no idea who's right. But doesn't this count as evidence that
right-trimming pbchars is not even safe with the en_US collation?

-- Thomas

[1] http://www.postgresql.org/message-id/flat/A737B7A37273E048B164557ADEF4A58B0579A7AB(at)ntex2010a(dot)host(dot)magwien(dot)gv(dot)at#A737B7A37273E048B164557ADEF4A58B0579A7AB@ntex2010a.host.magwien.gv.at

On Fri, Oct 11, 2013 at 4:58 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> Thomas Fanghaenel wrote:
>
>>> I was wondering about the proper semantics of CHAR comparisons in some corner
>>> cases that involve control characters with values that are less than 0x20
>>> (space).
>
> What matters in general isn't where the characters fall when
> comparing individual bytes, but how the strings containing them
> sort according to the applicable collation. That said, my
> recollection of the spec is that when two CHAR(n) values are
> compared, the shorter should be blank-padded before making the
> comparison. *That* said, I think the general advice is to stay
> away from CHAR(n) in favor or VARCHAR(n) or TEXT, and I think that
> is good advice.
>
>> I am sorry for this long email, but I would be interested to see what
>> other hackers think about this issue.
>
> Since we only have the CHAR(n) type to improve compliance with the
> SQL specification, and we don't generally encourage its use, I
> think we should fix any non-compliant behavior. That seems to mean
> that if you take two CHAR values and compare them, it should give
> the same result as comparing the same two values as VARCHAR using
> the same collation with the shorter value padded with spaces.
>
> So this is correct:
>
> test=# select 'ab'::char(3) collate "en_US" < E'ab\n'::char(3) collate "en_US";
> ?column?
> ----------
> t
> (1 row)
>
> ... because it matches:
>
> test=# select 'ab '::varchar(3) collate "en_US" < E'ab\n'::varchar(3) collate "en_US";
> ?column?
> ----------
> t
> (1 row)
>
> But this is incorrect:
>
> test=# select 'ab'::char(3) collate "C" < E'ab\n'::char(3) collate "C";
> ?column?
> ----------
> t
> (1 row)
>
> ... because it doesn't match:
>
> test=# select 'ab '::varchar(3) collate "C" < E'ab\n'::varchar(3) collate "C";
> ?column?
> ----------
> f
> (1 row)
>
> Of course, I have no skin in the game, because it took me about two
> weeks after my first time converting a database with CHAR columns
> to PostgreSQL to change them all to VARCHAR, and do that as part of
> all future conversions.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2013-10-14 18:02:59 Re: Patch for reserved connections for replication users
Previous Message Andres Freund 2013-10-14 17:51:30 Re: Patch for reserved connections for replication users

Browse pgsql-sql by date

  From Date Subject
Next Message Craig R. Skinner 2013-10-16 12:50:14 PostgreSQL 9.2 tstzrange null/infinity CONSTRAINT CHECK
Previous Message Kevin Grittner 2013-10-11 20:58:50 Re: [SQL] Comparison semantics of CHAR data type