multibyte-character aware support for function "downcase_truncate_identifier()"

Lists: pgsql-hackers
From: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-07-07 04:35:29
Message-ID: AANLkTinsNCP3NA7TqomnPKU2OgvS7TW1B1pf1JtY50fv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi All,

Every identifier is downcase & truncated by function
"downcase_truncate_identifier()"
before using it.

But since the function "downcase_truncate_identifier()" is not
multibyte-charecter aware,
it is not able to downcase some of special charecters in identifier like
"my_SchemÄ".

If schema is created of name "my_SchemÄ", pg_namespace shows entries as
"my_schemÄ" .

Example is as below :

postgres=# create schema my_SchemÄ;
CREATE SCHEMA
postgres=# select nspname from pg_namespace;
nspname
--------------------
pg_toast
pg_temp_1
pg_toast_temp_1
pg_catalog
public
information_schema
my_schemÄ
(7 rows)

postgres=#

Achually it should downcase as "my_schemä" as per multibyte-character aware
as lower()
works :

postgres=# select lower('my_SchemÄ');
lower
-----------
my_schemä
(1 row)

There is function str_tolower() which work as multibyte-character aware.
Need to use same function where ever downcase required. So, it will create
uniform down-casing at all places.

two places identified where need to add wide-character aware downcase :

1. downcase_truncate_identifier();
- Attaching patch for changes and small test case.

Following functions should also synchronise with
"downcase_truncate_identifier()" :

2. pg_strcasecmp();
3. pg_strncasecmp();

- to add fix at these functions (2,3) need to move str_tolower() from
formatting.c from backend to some common location (may be in src/port) from
where these can be used with client as well as server.

Thanks & Regards,
Rajanikant Chirmade.

Attachment Content-Type Size
fix.patch text/x-diff 1.3 KB
wide-charecter_aware_downcase.sql text/x-sql 312 bytes
wide-charecter_aware_downcase.out application/octet-stream 340 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-07-07 14:07:04
Message-ID: 5634.1278511624@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com> writes:
> Every identifier is downcase & truncated by function
> "downcase_truncate_identifier()"
> before using it.

> But since the function "downcase_truncate_identifier()" is not
> multibyte-charecter aware,
> it is not able to downcase some of special charecters in identifier like
> "my_Schem".

IIRC this is intentional. Please consult the archives for previous
discussions.

regards, tom lane


From: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-07-13 07:10:35
Message-ID: AANLkTinIEVpzizvdbisCByBb9b-ZMluj_p7FHmop8V8U@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 7, 2010 at 7:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com> writes:
> > Every identifier is downcase & truncated by function
> > "downcase_truncate_identifier()"
> > before using it.
>
> > But since the function "downcase_truncate_identifier()" is not
> > multibyte-charecter aware,
> > it is not able to downcase some of special charecters in identifier like
> > "my_SchemÄ".
>
>

> IIRC this is intentional. Please consult the archives for previous
> discussions.
>
> regards, tom lane
>

I got one discussion thread on same issue. But it stopped without any
conclusion.

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

Thanks & Regards,
Rajanikant Chirmade.


From: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-07-26 04:40:45
Message-ID: AANLkTi=DUaK18c5evT99X0=+FPyhyR-3GUeHnj2wZVcj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Since discussion stopped in discussion thread

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php

Are there any implications of this change in handling identifiers ?

Thanks & Regards,
Rajanikant Chirmade

On Tue, Jul 13, 2010 at 12:10 AM, Rajanikant Chirmade <
rajanikant(dot)chirmade(at)enterprisedb(dot)com> wrote:

>
>
> On Wed, Jul 7, 2010 at 7:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com> writes:
>> > Every identifier is downcase & truncated by function
>> > "downcase_truncate_identifier()"
>> > before using it.
>>
>> > But since the function "downcase_truncate_identifier()" is not
>> > multibyte-charecter aware,
>> > it is not able to downcase some of special charecters in identifier like
>> > "my_SchemÄ".
>>
>>
>
>
>
>
>> IIRC this is intentional. Please consult the archives for previous
>> discussions.
>>
>> regards, tom lane
>>
>
>
>
> I got one discussion thread on same issue. But it stopped without any
> conclusion.
>
> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php
>
> Thanks & Regards,
> Rajanikant Chirmade.
>
>
>
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-07-26 12:20:00
Message-ID: AANLkTimEZfcfFS4ANKpaxK05xP+XN36ecws-B4pux11Z@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 26, 2010 at 12:40 AM, Rajanikant Chirmade
<rajanikant(dot)chirmade(at)enterprisedb(dot)com> wrote:
> Since discussion stopped in discussion thread
>
> http://archives.postgresql.org/pgsql-bugs/2006-09/msg00128.php
>
> Are there any implications of this change in handling identifiers ?
>
> Thanks & Regards,
> Rajanikant Chirmade

An even more relevant message appears to be this one:

http://archives.postgresql.org/pgsql-bugs/2006-09/msg00133.php

Both this and the comment in downcase_truncate_identifier() suggests
that the current method is attributable to lack of support for
Unicode-aware case normalization and is known not to work correctly in
all locales. Locale and encoding stuff isn't really my area of
expertise, but if now have support for Unicode-aware case
normalization, shouldn't we be using it here, too?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 21:14:48
Message-ID: AANLkTikG91524vucsG0NxaqgsSsZJw5NzG97V+jFq5_8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com> writes:
>> Every identifier is downcase & truncated by function
>> "downcase_truncate_identifier()"
>> before using it.
>
>> But since the function "downcase_truncate_identifier()" is not
>> multibyte-charecter aware,
>> it is not able to downcase some of special charecters in identifier like
>> "my_SchemÄ".
>
> IIRC this is intentional.  Please consult the archives for previous
> discussions.

Why would this be intentional?

One concern I have about this approach is that I am guessing that the
current implementation of str_tolower() is a lot slower than the
current implementation of downcase_truncate_identifier(). It would be
nice to have an implementation that is capable of handling wide
characters but doesn't actually incur the speed penalty unless a wide
character is actually present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 21:41:35
Message-ID: 26799.1290375695@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> IIRC this is intentional. Please consult the archives for previous
>> discussions.

> Why would this be intentional?

Well, it's intentional for lack of any infrastructure that would allow
a more spec-compliant approach. As you say, calling str_tolower here
is probably a non-starter for performance reasons. Another big problem
is that str_tolower produces a locale-specific downcasing conversion.
This (a) is going to create portability headaches of the first magnitude,
and (b) is not really an advance in terms of spec compliance. The SQL
spec says that identifier case folding should be done according to the
Unicode standard, but it's not safe to assume that any random
platform-specific locale is going to act that way. A specific example
of a locale that is known to NOT behave acceptably is Turkish: they have
weird ideas about i versus I, which in fact broke things back when we
used to use tolower for this purpose. See the archives from early 2004,
and in particular commit 59f9a0b9df0d224bb62ff8ec5b65e0b187655742, which
removed the exact same logic (though not wide-character-aware) that this
patch proposes to put back.

I think the given patch can be rejected out of hand. If the OP has any
ideas about doing non-locale-dependent case folding at an acceptable
speed, I'm happy to listen.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:09:14
Message-ID: AANLkTikweY9M4vfR0KmKwZiit-w8siSgsSk3x6iuj8Rz@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 21, 2010 at 4:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Jul 7, 2010 at 10:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> IIRC this is intentional.  Please consult the archives for previous
>>> discussions.
>
>> Why would this be intentional?
>
> Well, it's intentional for lack of any infrastructure that would allow
> a more spec-compliant approach.  As you say, calling str_tolower here
> is probably a non-starter for performance reasons.  Another big problem
> is that str_tolower produces a locale-specific downcasing conversion.
> This (a) is going to create portability headaches of the first magnitude,
> and (b) is not really an advance in terms of spec compliance.  The SQL
> spec says that identifier case folding should be done according to the
> Unicode standard, but it's not safe to assume that any random
> platform-specific locale is going to act that way.  A specific example
> of a locale that is known to NOT behave acceptably is Turkish: they have
> weird ideas about i versus I, which in fact broke things back when we
> used to use tolower for this purpose.  See the archives from early 2004,
> and in particular commit 59f9a0b9df0d224bb62ff8ec5b65e0b187655742, which
> removed the exact same logic (though not wide-character-aware) that this
> patch proposes to put back.
>
> I think the given patch can be rejected out of hand.  If the OP has any
> ideas about doing non-locale-dependent case folding at an acceptable
> speed, I'm happy to listen.

I think that's fair. It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no? We use heuristics to copy as many
characters as possible without detailed examination and consult the
lookup table for the rest. However, that's not very practical in the
face of more than one encoding that must be handled. What sort of
infrastructure would actually be useful for dealing with this problem?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:22:31
Message-ID: 4CE9A9B7.1080707@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/2010 06:09 PM, Robert Haas wrote:
> I think that's fair. It actually doesn't seem like it should be that
> hard if we knew that the server encoding were UTF8 - it's just a big
> translation table somewhere, no?

No, it's far more complex. See for example
<http://unicode.org/reports/tr21/tr21-3.html>, which says:

There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII.

* Because of the inclusion of certain composite characters for
compatibility, such as 01F1 "DZ" /capital dz/, there is a
third case, called /titlecase/, which is used where the first
letter of a word is to be capitalized (e.g. Titlecase, vs.
UPPERCASE, or lowercase).
o For example, the title case of the example character is
01F2 "Dz" /capital d with small z/.
* Case mappings may produce strings of different length than the
original.
o For example, the German character 00DF "ß" /small letter
sharp s/ expands when uppercased to the sequence of two
characters "SS". This also occurs where there is no
precomposed character corresponding to a case mapping,
such as with 0149 "'n" /latin small letter n preceded by
apostrophe./
* Characters may also have different case mappings, depending on
the context.
o For example, 03A3 "?" /capital sigma/ lowercases to 03C3
"?" /small sigma/ if it is followed by another letter,
but lowercases to 03C2 "?" /small final sigma/ if it is not.
* Characters may have case mappings that depend on the locale.
o For example, in Turkish the letter 0049 "I" /capital
letter i/ lowercases to 0131 "?" /small dotless i/.
* Case mappings are not, in general, reversible.
o For example, once the string "McGowan" has been
uppercased, lowercased or titlecased, the original
cannot be recovered by applying another uppercase,
lowercase, or titlecase operation.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:24:30
Message-ID: AANLkTik-6ThpaSrnbXvkjXfJfmtTCm3RadYqTuf-q_sp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 21, 2010 at 6:22 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 11/21/2010 06:09 PM, Robert Haas wrote:
>
> I think that's fair. It actually doesn't seem like it should be that
> hard if we knew that the server encoding were UTF8 - it's just a big
> translation table somewhere, no?
>
> No, it's far more complex. See for example
> <http://unicode.org/reports/tr21/tr21-3.html>, which says:
>
> There are a number of complications to case mappings that occur once the
> repertoire of characters is expanded beyond ASCII.
>
> Because of the inclusion of certain composite characters for compatibility,
> such as 01F1 "DZ" capital dz, there is a third case, called titlecase, which
> is used where the first letter of a word is to be capitalized (e.g.
> Titlecase, vs. UPPERCASE, or lowercase).
>
> For example, the title case of the example character is 01F2 "Dz" capital d
> with small z.
>
> Case mappings may produce strings of different length than the original.
>
> For example, the German character 00DF "ß" small letter sharp s expands when
> uppercased to the sequence of two characters "SS". This also occurs where
> there is no precomposed character corresponding to a case mapping, such as
> with 0149 "ʼn" latin small letter n preceded by apostrophe.
>
> Characters may also have different case mappings, depending on the context.
>
> For example, 03A3 "Σ" capital sigma lowercases to 03C3 "σ" small sigma if it
> is followed by another letter, but lowercases to 03C2 "ς" small final sigma
> if it is not.
>
> Characters may have case mappings that depend on the locale.
>
> For example, in Turkish the letter 0049 "I" capital letter i lowercases to
> 0131 "ı" small dotless i.
>
> Case mappings are not, in general, reversible.
>
> For example, once the string "McGowan" has been uppercased, lowercased or
> titlecased, the original cannot be recovered by applying another uppercase,
> lowercase, or titlecase operation.

Yikes. So what do people do about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-21 23:48:01
Message-ID: 29502.1290383281@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 11/21/2010 06:09 PM, Robert Haas wrote:
>> I think that's fair. It actually doesn't seem like it should be that
>> hard if we knew that the server encoding were UTF8 - it's just a big
>> translation table somewhere, no?

> No, it's far more complex. See for example
> <http://unicode.org/reports/tr21/tr21-3.html>, which says:

Yeah. I'm actually not sure that the SQL committee has thought very
hard about this, because the spec is worded as though they think that
"Unicode case normalization" is all they have to say to uniquely define
what to do. The Unicode guys recognize that case mapping is
locale-specific, which puts us right back at square one. But leaving
spec compliance aside, we know from bitter experience that we cannot use
a definition that lets the Turkish locale fool with the mapping of i/I.
I suspect that locale-dependent mappings of any other characters are
just as bad, we simply haven't had enough users burnt by such cases to
have an institutional memory of it. But for example do you really think
it's a good idea if pg_dump and reload into a DB with a different locale
results in changing the normalized form of SQL identifiers?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-22 00:08:08
Message-ID: AANLkTiknTPwt+jfTigYCx8it7qxNLjiOku5H=2MLZxPo@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 21, 2010 at 6:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> On 11/21/2010 06:09 PM, Robert Haas wrote:
>>> I think that's fair.  It actually doesn't seem like it should be that
>>> hard if we knew that the server encoding were UTF8 - it's just a big
>>> translation table somewhere, no?
>
>> No, it's far more complex. See for example
>> <http://unicode.org/reports/tr21/tr21-3.html>, which says:
>
> Yeah.  I'm actually not sure that the SQL committee has thought very
> hard about this, because the spec is worded as though they think that
> "Unicode case normalization" is all they have to say to uniquely define
> what to do.  The Unicode guys recognize that case mapping is
> locale-specific, which puts us right back at square one.  But leaving
> spec compliance aside, we know from bitter experience that we cannot use
> a definition that lets the Turkish locale fool with the mapping of i/I.
> I suspect that locale-dependent mappings of any other characters are
> just as bad, we simply haven't had enough users burnt by such cases to
> have an institutional memory of it.  But for example do you really think
> it's a good idea if pg_dump and reload into a DB with a different locale
> results in changing the normalized form of SQL identifiers?

No, especially if it results in queries that used to work breaking,
which it well could. But I'm not sure where to go with it from there,
beyond throwing up my hands.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-22 00:38:48
Message-ID: 438.1290386328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Nov 21, 2010 at 6:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> spec compliance aside, we know from bitter experience that we cannot use
>> a definition that lets the Turkish locale fool with the mapping of i/I.
>> I suspect that locale-dependent mappings of any other characters are
>> just as bad, we simply haven't had enough users burnt by such cases to
>> have an institutional memory of it. But for example do you really think
>> it's a good idea if pg_dump and reload into a DB with a different locale
>> results in changing the normalized form of SQL identifiers?

> No, especially if it results in queries that used to work breaking,
> which it well could. But I'm not sure where to go with it from there,
> beyond throwing up my hands.

Well, that's why there's been no movement on this since 2004 :-(. The
amount of work needed for a better solution seems far out of proportion
to the benefits.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 16:14:13
Message-ID: AANLkTi=Vbk6f38acp9TL2AgHEeZ1tkjKgUoKN6Y=OdoF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> No, especially if it results in queries that used to work breaking,
>> which it well could.  But I'm not sure where to go with it from there,
>> beyond throwing up my hands.
>
> Well, that's why there's been no movement on this since 2004 :-(.  The
> amount of work needed for a better solution seems far out of proportion
> to the benefits.

We could extend the existing logic to handle multi-bytes characters
though, couldn't we? It's not going to fix all the problems but at
least it'll do something sane.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 17:12:49
Message-ID: 11120.1290532369@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, that's why there's been no movement on this since 2004 :-(. The
>> amount of work needed for a better solution seems far out of proportion
>> to the benefits.

> We could extend the existing logic to handle multi-bytes characters
> though, couldn't we? It's not going to fix all the problems but at
> least it'll do something sane.

Not easily, cheaply, or portably. The closest you could get in that
line would be to use towlower(), which doesn't exist everywhere
(though I grant probably most platforms have it by now). The much much
bigger problem though is that we don't know what character representation
towlower() deals in. We recently kluged the regex code to assume that
the wchar_t representation for UTF8 locales is the standardized Unicode
code point. I haven't heard of that breaking, but 9.0 hasn't been out
that long. In other multibyte encodings we have no idea how to use that
function, short of invoking mbstowcs/wcstombs or local equivalent, which
is expensive and doesn't readily allow a short-circuit for ASCII.

And, after you've hacked your way through all that, you still end up
with case-folding behavior that depends on the prevailing locale.
Which is dangerous for the previously cited reasons, and arguably not
spec-compliant.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 17:13:12
Message-ID: 4CEBF628.8060208@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/23/2010 11:14 AM, Greg Stark wrote:
> On Mon, Nov 22, 2010 at 12:38 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> No, especially if it results in queries that used to work breaking,
>>> which it well could. But I'm not sure where to go with it from there,
>>> beyond throwing up my hands.
>> Well, that's why there's been no movement on this since 2004 :-(. The
>> amount of work needed for a better solution seems far out of proportion
>> to the benefits.
> We could extend the existing logic to handle multi-bytes characters
> though, couldn't we? It's not going to fix all the problems but at
> least it'll do something sane.

What casing rules will you apply? How will you know what is an upper
case character and what its lower case character is? The sad, short
answer is that there are no simple rules beyond ASCII. See the URL I
posted upthread.

cheers

andrew


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 17:46:56
Message-ID: AANLkTi=r-VJOs6AjWWipbYuK98sVYxqXMKW5rh8yd29Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 23, 2010 at 5:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> And, after you've hacked your way through all that, you still end up
> with case-folding behavior that depends on the prevailing locale.
> Which is dangerous for the previously cited reasons, and arguably not
> spec-compliant.
>

So I thought the problem with the Turkish locale definition was that
it redefined how a capital ascii character which was present in
standard SQL identifiers was lowercased. Resulting in standard SQL
syntax not working.

I'm not sure I understand the danger if a user creates an object in a
database with a particular encoding and locale using that locale for
downcasing in the future. We don't currently support changing the
locale of a database or using different locales in the future. Even
with Peter's patch I think we can reasonably require the user to
specify a single locale which controls how the SQL identifiers are
interpreted regardless of the collations used in the operations.

The points about the C API being limited and nonportable are a
different issue.I guess I would need to do research to see if we're
missing something which would help here. Otherwise I might be
beginning to see the value in that /other/ library which I've argued
against in the past.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 17:51:55
Message-ID: 12028.1290534715@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> I'm not sure I understand the danger if a user creates an object in a
> database with a particular encoding and locale using that locale for
> downcasing in the future.

The case I was worried about is dumping from one database and reloading
into another one with a different locale. Although I suppose there are
enough *other* reasons why that might fail that adding changes of
downcasing behavior might not be a big deal.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 18:21:43
Message-ID: AANLkTino65ZV+Pk43PPAaKQHQ7nY1W6i9yf8KMphdoma@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 23, 2010 at 5:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The case I was worried about is dumping from one database and reloading
> into another one with a different locale.  Although I suppose there are
> enough *other* reasons why that might fail that adding changes of
> downcasing behavior might not be a big deal.

If you dump the whole database then pg_dump would create the new
database with the correct encoding and locale. If you change it then
that can already cause it to fail if the data can't be converted to
the new encoding. And as you point out there are all kinds of ways
you can cause that to fail by making the context incompatible with the
definitions you're loading.

The lesson we learned in the past is that we have to ignore the locale
for all the characters present in the standard identifiers. Beyond
that I think this is just an implementation problem which may be a
show stopper in itself but if we can do anything with mulitbyte
characters it's probably an improvement over what we do now.

--
greg


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rajanikant Chirmade <rajanikant(dot)chirmade(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multibyte-character aware support for function "downcase_truncate_identifier()"
Date: 2010-11-23 19:27:42
Message-ID: 1290540462.24521.6.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On sön, 2010-11-21 at 18:48 -0500, Tom Lane wrote:
> Yeah. I'm actually not sure that the SQL committee has thought very
> hard about this, because the spec is worded as though they think that
> "Unicode case normalization" is all they have to say to uniquely
> define what to do. The Unicode guys recognize that case mapping is
> locale-specific, which puts us right back at square one. But leaving
> spec compliance aside, we know from bitter experience that we cannot
> use a definition that lets the Turkish locale fool with the mapping of
> i/I. I suspect that locale-dependent mappings of any other characters
> are just as bad, we simply haven't had enough users burnt by such
> cases to have an institutional memory of it.

The number of locale-dependent case mappings in the entire universe of
Unicode is actually limited to 7 cases for Lithuanian and 8 cases for
Turkish. (ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt) So it
would be fair to say that there is a "default" case mapping, and that is
what the SQL standard presumably refers to.

One thing that we could do is let the user declare that he thinks his
current locale is consistent with the Unicode case normalization, and
apply the full Unicode conversion if so.