Re: Bringing PostgreSQL torwards the standard regarding

Lists: pgsql-hackers
From: Shachar Shemesh <psql(at)shemesh(dot)biz>
To: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Bringing PostgreSQL torwards the standard regarding case folding
Date: 2004-04-25 14:44:23
Message-ID: 408BCEC7.4080503@shemesh.biz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm opening a new thread, as the previous one was too nested, and
contained too much emotions.

I'll start by my understanding of a summary of the thread so far. The
solution we are seeking would have to satisfy the following conditions:
1. Setting should be on a per-database level. A per-server option is not
good enough, and a per-session option is too difficult to implement,
with no apparent justifiable return.
2. Old applications already working with PG's lowercase folding should
have an option to continue working unmodified for the foreseeable future.

Solutions offered so far, and their status:
1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
2. Dual state. Fold lower or upper. Break if client is broken.
3. Create a database conversion tool to change existing case.

Solution 1:
As currently the case folding is performed disjointed from the actual
use of the identifier, this solution requires quite a big amount of
work. On the other hand, and on second thought, it's main benefit -
gradual transition of applications from one to the other, is not really
necessary once you declare the current behaviour as there to stay.
Existing applications can simply choose to continue using whatever
method they currently use. No need for migration.

As such, I think we can simply state that tri-state migration path
solution can be discarded for the time being.

Solution 2:
Obviously, this is the way to go. We will have a dabase attribute that
states whether things are lower or upper case there.

Solution 3:
(unrelated to the above)
There seems to be some ambiguity about how to handle the translation.
Such a tool seems to require guessing which identifiers are accessed
quoted, unquoted, or both. The last option, of course, will never work.

We may need such a tool, for some projects may wish to transform from
one way to the other. It seems to me, however, that such a tool can wait
a little.

Open issues:
1. What do we do with identifiers in "template1" upon database creation?
2. How do we handle queries to tables belonging the catalog that are
shared between databases?

Observation: on a lowercase folding DB, any identifier that is not
composed only of lowercase characters MUST can be automatically assumed
to be accessed only through quoted mode.

I therefor suggest the following path to a solution:
1. CreateDB will be able to create databases from either type.
2. template1 will be defined to be one or the other. For the sake of
this discussion, let's assume it's lowercase (current situation)
3. CreateDB, upon being asked to create a new DB that has uppercase
folding, will copy over template1, as it currently does.
4. While copying, it will check each identifier. If the identifier is
not lowercase only, it is safe to copy it verbatim.
5. If the identifier is lowercase only, convert it to uppercase only. I
am assuming here that the authors of the client code chose an
uppercase-folding database, so they should know what they are doing when
accessing stuff from the standard offering.
6. I'm not sure what are the shared tables from the catalog. I don't
think it so unreasonable to ask anyone doing catalog work to assume that
catalog entries are case-sensitive. As such, maybe it's best to just
leave the data as is.
7. Column headers, however, will have to have a solution. A point still
open in current design.

I'm hoping this summary helps in furthering the discussion.

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Shachar Shemesh <psql(at)shemesh(dot)biz>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 15:11:33
Message-ID: 1082905892.6544.39.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 5. If the identifier is lowercase only, convert it to uppercase only. I
> am assuming here that the authors of the client code chose an
> uppercase-folding database, so they should know what they are doing when
> accessing stuff from the standard offering.

You've just broken one of my databases.

In one project I quote nearly everything and do so in lower case only.
This was done to ensure portability between PostgreSQL, Oracle, etc. --
but with my preference of lower case names.

If someone copied this database with the wrong case folding option, it
would break a (reasonably) spec compliant application that is regularly
installed on environments where we have little to no control over the
database settings.

I think copied attributes need to be left alone. Train the PostgreSQL
utilities to always quote the identifiers instead.

If you want case to be folded, run an external utility to does a bunch
of ALTER ... RENAMEs.


From: Shachar Shemesh <psql(at)shemesh(dot)biz>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 16:03:17
Message-ID: 408BE145.90201@shemesh.biz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rod Taylor wrote:

>>5. If the identifier is lowercase only, convert it to uppercase only. I
>>am assuming here that the authors of the client code chose an
>>uppercase-folding database, so they should know what they are doing when
>>accessing stuff from the standard offering.
>>
>>
>
>You've just broken one of my databases.
>
>In one project I quote nearly everything and do so in lower case only.
>This was done to ensure portability between PostgreSQL, Oracle, etc. --
>but with my preference of lower case names.
>
>
I'm not sure you understood me.

First, if we don't convert lower->upper, how can anyone expect the
following query to work:
select lower(id) from table;

Even if you quote everything, you'd still probably have:
select lower("id") from "table";

Noone can expect you to do:
select "lower"("id") from "table";

The problem is that "lower" is defined in template1. If we don't
uppercase it when we create the database, the above won't work. Then
again, I'm fairly sure that the identifiers you placed as lowercase in
your database are not defined by template1.

In short, I don't think this suggestion broke your database.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Shachar Shemesh <psql(at)shemesh(dot)biz>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 16:08:30
Message-ID: 1082909309.6544.81.camel@jester
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I'm not sure you understood me.

Perhaps not, here is what we have:

\c template_db
CREATE TABLE "example" ( "col" integer);

CREATE DATABASE newdb WITH TEMPLATE template_db UPPERCASE IDENTIFIERS;

\c newdb
SELECT "col" FROM "example";

> In short, I don't think this suggestion broke your database.

I understood that since "example", when copied, would turn into
"EXAMPLE" -- but that if it was "Example" it would be copied as
"Example".


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 16:23:55
Message-ID: 408BE61B.7020308@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shachar Shemesh wrote:

> I'm opening a new thread, as the previous one was too nested, and
> contained too much emotions.
>
> I'll start by my understanding of a summary of the thread so far. The
> solution we are seeking would have to satisfy the following conditions:
> 1. Setting should be on a per-database level. A per-server option is
> not good enough, and a per-session option is too difficult to
> implement, with no apparent justifiable return.

I am not convinced on this point. Why is per-server not good enough? The
obvious place to make these changes seems to me to be during or
immediatly after the bootstrap phase of initdb. It would avoid a host of
later troubles.

>
> 2. Old applications already working with PG's lowercase folding should
> have an option to continue working unmodified for the foreseeable future.
>
> Solutions offered so far, and their status:
> 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
> 2. Dual state. Fold lower or upper. Break if client is broken.
> 3. Create a database conversion tool to change existing case.

I don't think we should rush at this. All of these solutions are based
on the existing structures. I have started thinking about a solution
that would involve keeping two versions of catalog names: a canonical
name and a "name as supplied at creation". There would be heaps of
wrinkles, but it might get us where we want to be. But I have not had
time to sort it out in my head yet, let alone make any experiments.
Let's keep getting more ideas.

cheers

andrew


From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 17:15:34
Message-ID: 20040425171534.GA5479@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:

> I don't think we should rush at this. All of these solutions are based
> on the existing structures. I have started thinking about a solution
> that would involve keeping two versions of catalog names: a canonical
> name and a "name as supplied at creation".

Why do you want two names? Just keep the original casing, and a boolean
saying if it's quoted or not.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 18:16:19
Message-ID: 408C0073.2000907@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:

>On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:
>
>
>
>>I don't think we should rush at this. All of these solutions are based
>>on the existing structures. I have started thinking about a solution
>>that would involve keeping two versions of catalog names: a canonical
>>name and a "name as supplied at creation".
>>
>>
>
>Why do you want two names? Just keep the original casing, and a boolean
>saying if it's quoted or not.
>
>

Because I was not just considering the upper/lower issue - refer to the
thread that relates to unquoted names with case preserved.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-25 20:21:52
Message-ID: 408C1DE0.8060907@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:

>
>
> Alvaro Herrera wrote:
>
>> On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:
>>
>>
>>
>>> I don't think we should rush at this. All of these solutions are
>>> based on the existing structures. I have started thinking about a
>>> solution that would involve keeping two versions of catalog names: a
>>> canonical name and a "name as supplied at creation".
>>>
>>
>>
>> Why do you want two names? Just keep the original casing, and a boolean
>> saying if it's quoted or not.
>>
>>
>
> Because I was not just considering the upper/lower issue - refer to
> the thread that relates to unquoted names with case preserved.

Sorry - brain malfunction - yes, original casing plus boolean would
work. In effect you could derive the canonical form from those two.

cheers

andrew


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-26 05:15:08
Message-ID: Pine.LNX.4.44.0404260653440.3157-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 25 Apr 2004, Andrew Dunstan wrote:

> >> Why do you want two names? Just keep the original casing, and a boolean
> >> saying if it's quoted or not.
>
> Sorry - brain malfunction - yes, original casing plus boolean would
> work. In effect you could derive the canonical form from those two.

Say that you have this in the table with the identifier

name quoted
---- ------
Foo False

Now you want to add the name "FOO"

FOO True

should you be allowed or is it a clash with the above?

What if you also add "foo"

foo True

One of these two should be forbidden. And what about a quoted "FOO":

FOO False
FOO True

This case says it is not enough with an expressional unique index on
(upper(name), quoted). It would be easier to enforce uniqueness if one
store both the converted name and the original name:

name orig_name
---- ---------
FOO NULL <-- quoted one
FOO FOO <-- unquoted one

and the first case

FOO Foo <-- unquoted
FOO NULL <-- clashes with the first, good
foo NULL <-- no clash, works fine

With this one can always use upper case translation as per sql spec and
psql can optionally show all unquoted identifiers as upper, lower or mixed
case.

Then we also have the INFORMATION_SCHEMA that should show the names in
UPPER CASE when not quoted, this since applications that are written for
the standard might depend on that (probably no application do today but it
would be a valid case of use of the information schema).

--
/Dennis Björklund


From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: <db(at)zigo(dot)dhs(dot)org>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-26 06:34:58
Message-ID: 3320.24.211.141.25.1082961298.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund said:
> On Sun, 25 Apr 2004, Andrew Dunstan wrote:
>
>> >> Why do you want two names? Just keep the original casing, and a
>> >> boolean saying if it's quoted or not.
>>
>> Sorry - brain malfunction - yes, original casing plus boolean would
>> work. In effect you could derive the canonical form from those two.
>

Dennis,

Ideas still swirling a bit, but I was thinking that there would be a per
database flag (which could indeed be set at db creation time) which would
specify the flavor of canonical names being used - upper, or lower, or we
could also consider exact (i.e. full case sensitivity, which I seem to
recall is a mode that SQLServer allows, possibly even the default, but my
memory could be rusty).

The canonical form of an unquoted name is dictated by this setting, while
the canonical form of a quoted name is the name as supplied. Two names
clash if their canonical forms are identical, quoted or not.

Assuming that we have a database with the flag set to use upper case
canonical names, as per the standard, then ...

> Say that you have this in the table with the identifier
>
> name quoted
> ---- ------
> Foo False
>
> Now you want to add the name "FOO"
>
> FOO True
>
> should you be allowed or is it a clash with the above?

It's a clash. The canonical for of both is "FOO"

>
> What if you also add "foo"
>
> foo True
>

No clash - "FOO" <> "foo"

> One of these two should be forbidden. And what about a quoted "FOO":
>
> FOO False
> FOO True
>

clash

> This case says it is not enough with an expressional unique index on
> (upper(name), quoted). It would be easier to enforce uniqueness if one
> store both the converted name and the original name:
>

The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
(name) END.

The advantage of using a boolean is that a lot less work would need to be
done to use whatever flag was being used for the DB. Possibly a reindex
after the files are copied. It might fail on some highly pathological
cases, but should never fail on our standard template databases.

> name orig_name
> ---- ---------
> FOO NULL <-- quoted one
> FOO FOO <-- unquoted one
>
> and the first case
>
> FOO Foo <-- unquoted
> FOO NULL <-- clashes with the first, good foo
> NULL <-- no clash, works fine
>
> With this one can always use upper case translation as per sql spec and
> psql can optionally show all unquoted identifiers as upper, lower or
> mixed case.
>

My thought was that there would be a user setting that would allow
resultset labels to use either canonical or literal names.

> Then we also have the INFORMATION_SCHEMA that should show the names in
> UPPER CASE when not quoted, this since applications that are written
> for the standard might depend on that (probably no application do today
> but it would be a valid case of use of the information schema).
>

I see 2 possibilities: either use the upper case canonical setting I
envisioned above, or change the information schema setup to force upper
case labels via AS clauses in the views.

cheers

andrew


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-04-26 06:42:14
Message-ID: Pine.LNX.4.44.0404260840070.3157-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 26 Apr 2004, Andrew Dunstan wrote:

> Ideas still swirling a bit

Sure, I'm thinking in public as well. Not something you want to do if you
are afraid of being wrong and showing it :-) But I'm not.

> The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
> (name) END.

That's simple enough (and pretty straight forward).

--
/Dennis Björklund


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bringing PostgreSQL torwards the standard regarding
Date: 2004-05-13 22:57:46
Message-ID: 40A3FD6A.7070102@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:

> This is the reason why the setting has to be at least per database and
> cannot be changed after DB creation.

I think there's overwhelming consensus that db creation time is the
latest you can specify the canonical name setting for it. There's
probably a good case to be made for it to be when you initdb, so that it
is set as expected for shared tables.

(Is anyone actually doing anything on this?)

cheers

andrew

> What has to change is the behaviour of the name type operators, which
> will automatically change the uniqueness behaviour of the catalog
> indexes.
>
> In an UPPERCASE database
>
> foo/Foo/FOO false = FOO true
>
> In a lowercase database
>
> foo/Foo/FOO false = foo true
>
> In both of them
>
> foo/Foo/FOO false <> Foo true
> foo/Foo/FOO false = foo/Foo/FOO false
>
>
> Jan
>
>
> Dennis Bjorklund wrote:
>
>> On Sun, 25 Apr 2004, Andrew Dunstan wrote:
>>
>>> >> Why do you want two names? Just keep the original casing, and a
>>> boolean
>>> >> saying if it's quoted or not.
>>>
>>> Sorry - brain malfunction - yes, original casing plus boolean would
>>> work. In effect you could derive the canonical form from those two.
>>
>>
>> Say that you have this in the table with the identifier
>>
>> name quoted
>> ---- ------
>> Foo False
>>
>> Now you want to add the name "FOO"
>>
>> FOO True
>>
>> should you be allowed or is it a clash with the above?
>>
>> What if you also add "foo"
>>
>> foo True
>>
>> One of these two should be forbidden. And what about a quoted "FOO":
>>
>> FOO False
>> FOO True
>>
>> This case says it is not enough with an expressional unique index on
>> (upper(name), quoted). It would be easier to enforce uniqueness if one
>> store both the converted name and the original name:
>>
>> name orig_name
>> ---- ---------
>> FOO NULL <-- quoted one
>> FOO FOO <-- unquoted one
>>
>> and the first case
>>
>> FOO Foo <-- unquoted
>> FOO NULL <-- clashes with the first, good
>> foo NULL <-- no clash, works fine
>>
>> With this one can always use upper case translation as per sql spec and
>> psql can optionally show all unquoted identifiers as upper, lower or
>> mixed
>> case.
>>
>> Then we also have the INFORMATION_SCHEMA that should show the names in
>> UPPER CASE when not quoted, this since applications that are written for
>> the standard might depend on that (probably no application do today
>> but it
>> would be a valid case of use of the information schema).
>>
>
>