Quick Links

Per-column collation, work in progress

Lists:	pgsql-hackers

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Per-column collation, work in progress
Date:	2010-09-15 20:46:08
Message-ID:	1284583568.4696.20.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Following up on my previous patch [0], here is a fairly complete
implementation of this feature. The general description and
implementation outline of the previous message still apply. This patch
contains documentation and regression tests, which can serve as further
explanations.

As this patch touches pretty much everything in the system, there are
probably countless bugs and bogosities, some of which I have marked with
FIXME, TODO, etc. But all the functionality is basically there, so it's
time someone else gives this a serious examination.

Note: As previously, regression tests only work with "make check
MULTIBYTE=UTF8" and the feature overall only works on Linux/glibc.

[0]
http://archives.postgresql.org/message-id/1279045531.32647.14.camel@vanquo.pezone.net

Attachment	Content-Type	Size
collate-20100915.patch.gz	application/x-gzip	82.3 KB

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-22 10:44:24
Message-ID:	AANLkTindnuQ1Yipo-u-O3=mimncrrrtdoZ85xRthhJfp@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 16, 2010 at 5:46 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> Following up on my previous patch [0], here is a fairly complete
> implementation of this feature. The general description and
> implementation outline of the previous message still apply. This patch
> contains documentation and regression tests, which can serve as further
> explanations.

I tested the patch on database with encoding=UTF8 and locale-C.
I have a couple of questions and comments.

* CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
We need to copy collations by default, or add INCLUDING COLLATE option.

postgres=# SELECT name, setting FROM pg_settings WHERE name IN
('lc_ctype', 'lc_collate');
name | setting
------------+---------
lc_collate | C
lc_ctype | C
(2 rows)

postgres=# SELECT upper(c) FROM tbl;
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
postgres=# SELECT upper(ja) FROM tbl;
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding

* Comparison of strings with different collations is forbidden,
but assignment is allowed, right?

postgres=# SELECT * FROM tbl WHERE c = ja;
ERROR: collation mismatch between implicit collations "C" and "ja_JP.utf8"
LINE 1: SELECT * FROM tbl WHERE c = ja;
^
HINT: You can override the collation by applying the COLLATE clause
to one or both expressions.
postgres=# INSERT INTO tbl(c, ja) SELECT ja, c FROM tbl;
INSERT 0 6

> the feature overall only works on Linux/glibc.

We could support it also on MSVC.
http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

--
Itagaki Takahiro

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-22 13:29:30
Message-ID:	1285162170.15691.51.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
> * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
> We need to copy collations by default, or add INCLUDING COLLATE option.

OK, should be easy to fix.

> * upper() doesn't work if a column has a collation.
> It still works if a column doesn't have a collation.

I think what you are observing is the result of mixing C and non-C
locales. Of course that should also be fixed, but it doesn't have much
to do with what upper() does. Note that there is a regression test case
for lower(), which works mostly the same way.

> * Comparison of strings with different collations is forbidden,
> but assignment is allowed, right?

Correct.

> * psql \d needs a separator between collate and not null modifiers.

OK.

> We could support it also on MSVC.
> http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
> http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

Great.

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 08:12:32
Message-ID:	AANLkTimbum+D57BeLbRvq3t5LKa8iO4oLj=vjT+9mRLk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello

I am playing with your patch now. I found a few issues:

1. It's doesn't work with SQL 92 rules for sortby list. I can
understand so explicit COLLATE using doesn't work, but the implicit
using doesn't work too:

CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")

SELECT * FROM foo ORDER BY 1 -- produce wrong order

2. Why default encoding for collate is static? There are latin2 for
czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
encoding explicitly. But the more used and preferred encoding is UTF8
now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
to_char
──────────
thursday -- bad result
(1 row)

4. is somewhere ToDo for collation implementation?

postgres=# create table xy(a text, b text collate "cs_CZ");
ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist

can be there some more friendly message or hint ? like "you cannot to
use a different encoding". This collate is in pg_collates table.

--
patch was applied cleanly and works in very well. Thank you.

Regards

Pavel Stehule

2010/9/15 Peter Eisentraut <peter_e(at)gmx(dot)net>:
> Following up on my previous patch [0], here is a fairly complete
> implementation of this feature. The general description and
> implementation outline of the previous message still apply. This patch
> contains documentation and regression tests, which can serve as further
> explanations.
>
> As this patch touches pretty much everything in the system, there are
> probably countless bugs and bogosities, some of which I have marked with
> FIXME, TODO, etc. But all the functionality is basically there, so it's
> time someone else gives this a serious examination.
>
> Note: As previously, regression tests only work with "make check
> MULTIBYTE=UTF8" and the feature overall only works on Linux/glibc.
>
> [0]
> http://archives.postgresql.org/message-id/1279045531.32647.14.camel@vanquo.pezone.net
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 08:29:45
Message-ID:	AANLkTi=zza7phKqTRruB-Ju0z0_1oA=uVH3hq+hq+s2W@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> 3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
> to_char
> ──────────
> thursday -- bad result
> (1 row)

COLLATE means "collation" rather than "locale", no?

> 5.
> postgres=# create table xy(a text, b text collate "cs_CZ");
> ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
> can be there some more friendly message or hint ?

I hope Postgres automatically detects the omitted encoding
because it knows the database encoding is UTF8.

--
Itagaki Takahiro

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 08:48:05
Message-ID:	AANLkTi=LVy8_JxZG+PJ39yASQWHCF=z0E3qCndDco3pq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/9/23 Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>:
> On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>> 3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
>> to_char
>> ──────────
>> thursday -- bad result
>> (1 row)
>
> COLLATE means "collation" rather than "locale", no?

ok.

>
>> 5.
>> postgres=# create table xy(a text, b text collate "cs_CZ");
>> ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
>> can be there some more friendly message or hint ?
>
> I hope Postgres automatically detects the omitted encoding
> because it knows the database encoding is UTF8.

I know what this issue means, but it needs some detail or hint I think

Regards

Pavel

>
> --
> Itagaki Takahiro
>

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 09:03:03
Message-ID:	1285232583.27917.11.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
> 1. It's doesn't work with SQL 92 rules for sortby list. I can
> understand so explicit COLLATE using doesn't work, but the implicit
> using doesn't work too:
>
> CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")
>
> SELECT * FROM foo ORDER BY 1 -- produce wrong order

I can't reproduce that. Please provide more details.

> 2. Why default encoding for collate is static? There are latin2 for
> czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
> encoding explicitly. But the more used and preferred encoding is UTF8
> now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

That's tweakable. One idea I had is to strip the ".utf8" suffix from
locale names when populating the pg_collation catalog, or create both
versions. I agree that the current way is a bit cumbersome.

> 3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
> to_char
> ──────────
> thursday -- bad result
> (1 row)

As was already pointed out, collation only covers lc_collate and
lc_ctype. (It could cover other things, for example an application to
the money type was briefly discussed, but that's outside the current
mandate.)

As a point of order, what you wrote above attaches a collation to the
result of the function call. To get the collation to apply to the
function call itself, you have to put the collate clause on one of the
arguments, e.g.,

select to_char(current_date,'tmday' collate "cs_CZ.utf8");

> 4. is somewhere ToDo for collation implementation?

At the moment it's mostly in the source code. I have a list of notes
locally that I can clean up and put in the wiki once we agree on the
general direction.

> 5.
>
> postgres=# create table xy(a text, b text collate "cs_CZ");
> ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
>
> can be there some more friendly message or hint ? like "you cannot to
> use a different encoding". This collate is in pg_collates table.

That can surely be polished.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 09:10:01
Message-ID:	1285233001.27917.14.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tor, 2010-09-23 at 17:29 +0900, Itagaki Takahiro wrote:
> On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> > 5.
> > postgres=# create table xy(a text, b text collate "cs_CZ");
> > ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
> > can be there some more friendly message or hint ?
>
> I hope Postgres automatically detects the omitted encoding
> because it knows the database encoding is UTF8.

I would rather not build too many expectations into this yet. The
collation names are chosen by the user, the locale names are from the
operating system. There is not necessarily a correspondence. The best
fix is probably what I described earlier, populate the pg_collation
table with the ".utf8" suffix stripped.

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-23 09:55:21
Message-ID:	AANLkTinFt2U0NibM0UX=Pw-bSTTCpUp2o9pH9NdPj_+m@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/9/23 Peter Eisentraut <peter_e(at)gmx(dot)net>:
> On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
>> 1. It's doesn't work with SQL 92 rules for sortby list. I can
>> understand so explicit COLLATE using doesn't work, but the implicit
>> using doesn't work too:
>>
>> CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")
>>
>> SELECT * FROM foo ORDER BY 1 -- produce wrong order
>
> I can't reproduce that. Please provide more details.

sorry, it is ok - I was confused

>
>> 2. Why default encoding for collate is static? There are latin2 for
>> czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
>> encoding explicitly. But the more used and preferred encoding is UTF8
>> now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.
>
> That's tweakable. One idea I had is to strip the ".utf8" suffix from
> locale names when populating the pg_collation catalog, or create both
> versions. I agree that the current way is a bit cumbersome.
>

yes. now almost all databases are in utf8

>> 3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
>> to_char
>> ──────────
>> thursday -- bad result
>> (1 row)
>
> As was already pointed out, collation only covers lc_collate and
> lc_ctype. (It could cover other things, for example an application to
> the money type was briefly discussed, but that's outside the current
> mandate.)
>
ook

> As a point of order, what you wrote above attaches a collation to the
> result of the function call. To get the collation to apply to the
> function call itself, you have to put the collate clause on one of the
> arguments, e.g.,
>
> select to_char(current_date,'tmday' collate "cs_CZ.utf8");

I am thinking, collates can be used for this purpose too. I see some
impacts - this syntax changes a stable function to immutable and it
cannot be simple to solve.

>
>> 4. is somewhere ToDo for collation implementation?
>
> At the moment it's mostly in the source code. I have a list of notes
> locally that I can clean up and put in the wiki once we agree on the
> general direction.
>
>> 5.
>>
>> postgres=# create table xy(a text, b text collate "cs_CZ");
>> ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
>>
>> can be there some more friendly message or hint ? like "you cannot to
>> use a different encoding". This collate is in pg_collates table.
>
> That can surely be polished.
>
>

Regards

Pavel Stehule

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-24 00:32:41
Message-ID:	AANLkTikBjMBb70p=6+u2Fkvf=SysdEg60FQE=fZdLY3h@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> We could support it also on MSVC.
>> http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
>> http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
>
> Great.

If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
in pg_dump? Since collation names depend on platforms, a backup dumped
at UNIX cannot be reloaded to Windows. We might need to normalize
locale names to generate a portable dump.

--
Itagaki Takahiro

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-24 05:57:55
Message-ID:	1285307875.27917.26.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
> On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> >> We could support it also on MSVC.
> >> http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
> >> http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
> >
> > Great.
>
> If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
> in pg_dump? Since collation names depend on platforms, a backup dumped
> at UNIX cannot be reloaded to Windows. We might need to normalize
> locale names to generate a portable dump.

It is not necessary that the SQL collation names are the same as the OS
locale names. That is just a convenient way to initialize it. If you
need to transport dumps, you can create your own SQL collation entry at
the target that locally fits what you are trying to do (or perhaps
rename the collation at the source).

I don't think there is a universally applicable way to "normalize"
locale names, because we want to support user-defined OS locales.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-24 06:00:59
Message-ID:	1285308059.27917.29.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
> > select to_char(current_date,'tmday' collate "cs_CZ.utf8");
>
> I am thinking, collates can be used for this purpose too. I see some
> impacts - this syntax changes a stable function to immutable and it
> cannot be simple to solve.

I don't understand how you come to that conclusion.

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-24 06:39:53
Message-ID:	AANLkTikKF7tEvHbwfi123bX7KAcU7uhUTZgP1QrTPTjN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/9/24 Peter Eisentraut <peter_e(at)gmx(dot)net>:
> On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
>> > select to_char(current_date,'tmday' collate "cs_CZ.utf8");
>>
>> I am thinking, collates can be used for this purpose too. I see some
>> impacts - this syntax changes a stable function to immutable and it
>> cannot be simple to solve.
>
> I don't understand how you come to that conclusion.

sorry, I was wrong - it has sense for date output function. to_char is
immutable everywhere

Pavel

>
>
>

From:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-26 12:15:25
Message-ID:	AANLkTi=2=RgAkBv9qZiucEFCgXO=dYgytxJkMdKbJtSn@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Peter

Is there any reason why you prohibit a different encodings per one
database? Actually people expect from collate per column a possibility
to store a two or more different encodings per one database. Without
this possibility - only UTF8 is possible for practical work - and for
other encodings only pairs (national locale + C). Yes - it is from my
perspective (as Czech programmer) - very typical situation and request
is mix latin2 and latin1. I can live with limit, but it is very hard
limit and should be documented.

Regards

Pavel

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-26 13:37:02
Message-ID:	AANLkTimb3+_E7=3o9u6_GEV7V3w=FmRMroSgw60SNWrJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 26, 2010 at 1:15 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> Is there any reason why you prohibit a different encodings per one
> database? Actually people expect from collate per column a possibility
> to store a two or more different encodings per one database.

These are two completely separate problems that only look related. The
main difference is that while collation is a property of the
comparison or sort you're performing encoding is actually a property
of the string itself. It doesn't make sense to specify a different
encoding than what the string actually contains.

You could actually do what you want now by using bytea columns and
convert_to/convert_from and it wouldn't be much easier if the support
were built into text since you would still have to keep track of the
encoding it's in and the encoding you want. We could have a
encoded_text data type which includes both the encoding and the string
and which any comparison function automatically handles conversion
based on the encoding of the collation requested -- but I wouldn't
want that to be the default text datatype. It would impose a lot of
overhead on the basic text operations and magnify the effects of
choosing the wrong collation.

--
greg

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-09-26 13:54:29
Message-ID:	4C9F5095.6020904@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/26/2010 09:37 AM, Greg Stark wrote:
> We could have a
> encoded_text data type which includes both the encoding and the string
> and which any comparison function automatically handles conversion
> based on the encoding of the collation requested -- but I wouldn't
> want that to be the default text datatype. It would impose a lot of
> overhead on the basic text operations and magnify the effects of
> choosing the wrong collation.

Yeah, but it would be a nice gadget to have.

cheers

andrew

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-13 23:15:18
Message-ID:	AANLkTimd4wFrdeB=UwHPdQvj3L-aGjgr5XWUuuAynUmd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 24, 2010 at 1:57 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
>> On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> >> We could support it also on MSVC.
>> >> http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
>> >> http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
>> >
>> > Great.
>>
>> If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
>> in pg_dump? Since collation names depend on platforms, a backup dumped
>> at UNIX cannot be reloaded to Windows. We might need to normalize
>> locale names to generate a portable dump.
>
> It is not necessary that the SQL collation names are the same as the OS
> locale names. That is just a convenient way to initialize it. If you
> need to transport dumps, you can create your own SQL collation entry at
> the target that locally fits what you are trying to do (or perhaps
> rename the collation at the source).
>
> I don't think there is a universally applicable way to "normalize"
> locale names, because we want to support user-defined OS locales.

What's the status of this patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-14 16:53:50
Message-ID:	1287075230.9532.6.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
> What's the status of this patch?

I would appreciate some more review of the basic architecture.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-15 02:54:40
Message-ID:	AANLkTikzfXX1jt4Noyi_PEQdE=vUvvnPhjW+hdMeg9xY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 14, 2010 at 12:53 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
>> What's the status of this patch?
>
> I would appreciate some more review of the basic architecture.

Wow, what a patch. On the whole, I think this looks pretty good. Of
course, the obvious thing to be dismayed about is how many parts of
the system this touches. To some extent, that's probably inevitable
and maybe not that bad, but I wonder if there is some preparatory
refactoring that could be done to trim it down a bit. I notice, for
example, that a lot of places that looked at <asc/desc, nulls
first/last> now look at <asc/desc, nulls first/last, collation>. In
particular, all the pathkey stuff is like this. And similarly places
that used to care about <type, typmod> now have to care about <type,
tymod, collation>. There might be ways to restructure some of this
code so that these things can be changed without having to touch quite
so many places. If we're extending these lists from two items to
three, do we need to worry about what happens when they grow to four
or five or six? I particularly think this is in issue for the type
information. We are still finding bugs where typemod isn't carried
through properly; this kind of thing is only going to make it much
worse. We need to encapsulate it in some future-proof way.

It seems you've falsified the header comment in
pathkeys_useful_for_merging(), although I guess it's already false
because it doesn't seem to have been updated for the NULLS ASC/DESC
stuff, and the interior comment in right_merge_direction() also needs
adjusting. But this might be more than a documentation problem,
because the choice of merge direction really *is* arbitrary in the
case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
actually true for collation. If collation affects the definition of
equality then it certainly isn't true.

It looks like you've define collations as objects that exist within
particular namespaces, but there's no CREATE COLLATION statement, so I
don't see what purpose this serves. I suppose we could leave that to
be added later, but is there actually a use case for having collations
in individual schemas, or should we treat them more like we do casts -
i.e. as database-global objects?

Why does the executor ever need to see collate clauses?

In the department of minor nits, the use of the word "respectively" in
the CREATE INDEX documentation doesn't make sense to me. The message
about "has a collation conflict" is not self-explanatory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 18:44:36
Message-ID:	1287686676.29719.7.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
> and maybe not that bad, but I wonder if there is some preparatory
> refactoring that could be done to trim it down a bit. I notice, for
> example, that a lot of places that looked at <asc/desc, nulls
> first/last> now look at <asc/desc, nulls first/last, collation>. In
> particular, all the pathkey stuff is like this. And similarly places
> that used to care about <type, typmod> now have to care about <type,
> tymod, collation>. There might be ways to restructure some of this
> code so that these things can be changed without having to touch quite
> so many places.

Yeah, I think that's what I'll try to do next.

We already have TypeName as a structure that contains type and typmod
(and collation, in my patch). We could make that a primnode instead of
a parsenode, and use it in more places, or we could make a new leaner
structure that only contains the numeric info.

We could then, for example, change things like this:

typedef struct Var
{
Expr xpr;
...
Oid vartype;
int32 vartypmod;
...
}

into this

typedef struct Var
{
Expr xpr;
...
TypeName/TypeFoo vartype;
...
}

This would save boatloads of duplicate code.

> It looks like you've define collations as objects that exist within
> particular namespaces, but there's no CREATE COLLATION statement, so I
> don't see what purpose this serves. I suppose we could leave that to
> be added later, but is there actually a use case for having collations
> in individual schemas, or should we treat them more like we do casts -
> i.e. as database-global objects?

The SQL standard defines it that way, and there should be a CREATE
COLLATION statement later. Application-specific collation sequences
might not be unreasonable in the future.

> Why does the executor ever need to see collate clauses?

Hmm, maybe not. I think it did in an earlier working draft.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 19:06:33
Message-ID:	AANLkTinLaCHcgRB5_KF_B_GAtX11+68k=eVBiNR9Fq7g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 21, 2010 at 2:44 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
>> and maybe not that bad, but I wonder if there is some preparatory
>> refactoring that could be done to trim it down a bit. I notice, for
>> example, that a lot of places that looked at <asc/desc, nulls
>> first/last> now look at <asc/desc, nulls first/last, collation>. In
>> particular, all the pathkey stuff is like this. And similarly places
>> that used to care about <type, typmod> now have to care about <type,
>> tymod, collation>. There might be ways to restructure some of this
>> code so that these things can be changed without having to touch quite
>> so many places.
>
> Yeah, I think that's what I'll try to do next.
>
> We already have TypeName as a structure that contains type and typmod
> (and collation, in my patch). We could make that a primnode instead of
> a parsenode, and use it in more places, or we could make a new leaner
> structure that only contains the numeric info.
>
> We could then, for example, change things like this:
>
> typedef struct Var
> {
> Expr xpr;
> ...
> Oid vartype;
> int32 vartypmod;
> ...
> }
>
> into this
>
> typedef struct Var
> {
> Expr xpr;
> ...
> TypeName/TypeFoo vartype;
> ...
> }
>
> This would save boatloads of duplicate code.

I think that the idea of having a node that represents "a type in all
its glory" is a very good one. I'm somewhat inclined not to reuse
TypeName, because I think we'll end up wanting to use this in places
where "names" and "location" are not available. In fact, judging by
some of the logic in LookupTypeNames(), we have some cases like that
already, which might be worth trying to clean up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 20:28:52
Message-ID:	5685.1287692932@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> We already have TypeName as a structure that contains type and typmod
> (and collation, in my patch). We could make that a primnode instead of
> a parsenode, and use it in more places, or we could make a new leaner
> structure that only contains the numeric info.

TypeName per se is completely inappropriate for use beyond the first
stage of parsing, because it requires catalog lookups to make any sense
of. I think the post-parsing representation should still start with a
type OID. I can agree with replacing typmod with a struct, though.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 20:31:50
Message-ID:	AANLkTimozUaHHB3-q_4m3C3NgaWijTJkYm8aS+sqvq3q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
>> We already have TypeName as a structure that contains type and typmod
>> (and collation, in my patch). We could make that a primnode instead of
>> a parsenode, and use it in more places, or we could make a new leaner
>> structure that only contains the numeric info.
>
> TypeName per se is completely inappropriate for use beyond the first
> stage of parsing, because it requires catalog lookups to make any sense
> of. I think the post-parsing representation should still start with a
> type OID. I can agree with replacing typmod with a struct, though.

I think we should have both the type OID and the typmod in the struct.
Carrying the type OID separately from the typmod has caused us enough
heartache already. No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 20:39:43
Message-ID:	5956.1287693583@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> TypeName per se is completely inappropriate for use beyond the first
>> stage of parsing, because it requires catalog lookups to make any sense
>> of. I think the post-parsing representation should still start with a
>> type OID. I can agree with replacing typmod with a struct, though.

> I think we should have both the type OID and the typmod in the struct.
> Carrying the type OID separately from the typmod has caused us enough
> heartache already. No?

I think that that would probably involve a whole lot more notational
busywork than just replacing typmod with something more complicated.
However, we're talking about breaking vast amounts of code in either
case, so maybe making it even vaster isn't a real consideration.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-10-21 20:44:33
Message-ID:	4CC0A631.1020708@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I think that that would probably involve a whole lot more notational
> busywork than just replacing typmod with something more complicated.
> However, we're talking about breaking vast amounts of code in either
> case, so maybe making it even vaster isn't a real consideration.

Gods, yes. Please let's not extend typemod any further without an overhaul.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-11-24 20:37:54
Message-ID:	1290631074.23791.10.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
> It seems you've falsified the header comment in
> pathkeys_useful_for_merging(), although I guess it's already false
> because it doesn't seem to have been updated for the NULLS ASC/DESC
> stuff, and the interior comment in right_merge_direction() also needs
> adjusting. But this might be more than a documentation problem,
> because the choice of merge direction really *is* arbitrary in the
> case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
> actually true for collation. If collation affects the definition of
> equality then it certainly isn't true.

I did check that again and didn't arrive at the conclusion that the
comments would need updating either with respect to this patch or some
previous change. Could you check again and possibly provide a
suggestion?

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-11-24 20:42:00
Message-ID:	1290631320.23791.11.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
> * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.

This was fixed in the CF2010-11 patch.

> * psql \d needs a separator between collate and not null modifiers.

And this as well.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Per-column collation, work in progress
Date:	2010-11-25 03:57:31
Message-ID:	AANLkTik9wdgNvhEJfstc2gboza5KbMHq+WcA8tPCuGXj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 3:37 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
>> It seems you've falsified the header comment in
>> pathkeys_useful_for_merging(), although I guess it's already false
>> because it doesn't seem to have been updated for the NULLS ASC/DESC
>> stuff, and the interior comment in right_merge_direction() also needs
>> adjusting. But this might be more than a documentation problem,
>> because the choice of merge direction really *is* arbitrary in the
>> case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
>> actually true for collation. If collation affects the definition of
>> equality then it certainly isn't true.
>
> I did check that again and didn't arrive at the conclusion that the
> comments would need updating either with respect to this patch or some
> previous change. Could you check again and possibly provide a
> suggestion?

I think that you are right and that my previous comment was erroneous.
Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company