TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers

Lists: pgsql-hackers
From: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-24 00:08:45
Message-ID: 47E6F10D.5070601@pws.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

As an attempt at a first PostgreSQL patch, I'd like to see if I can do
anything about this issue.

I've read both the attached threads;

http://archives.postgresql.org/pgsql-hackers/2004-04/msg00818.php
http://archives.postgresql.org/pgsql-hackers/2006-10/msg01527.php

There seems no consensus about how to go about this. I have done some
initial changes and found some problems with my attempts. initdb always
creates pg_catalog relations with lowercase names, as does the function
list. eg count() with uppercased identifiers requires "count"(). All
of these can be altered on database copy. It shouldn't be a problem.
However I see shared relations as a big problem. The 2004 thread
suggests that we want a per database setting. I am unable to see how we
share shared relations between databases with different case folder.

pg_user is an example of this;

Lowercase database; CREATE ROLE mrruss LOGIN; results in -> mrruss
as data in pg_user
Uppercase database; CREATE ROLE mrruss LOGIN; resutls in -> MRRUSS as
data in pg_user

Now both of those can be accessed from any database. And you will get a
different user based on the source database.

Overall, I'd like to concentrate on the implementation as I'm a
beginner. But I see this being mainly a problem with nailing down the
actual requirement for the implementation. So I'll try to start the
discussion to allow me or somebody else to eventually develop a patch
for this.

The first question is, are all the requirements of the 2004 thread still
true now?

Setting case folder at initdb time seems the easiest method but I'm not
sure if that's what people want. Any GUC variables seem to play havoc
with the pg_catalog schema and the data in the catalogs.

Ideas and comments?

Thanks

Russell


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-24 00:56:57
Message-ID: 47E6FC59.7090105@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Russell Smith wrote:
> The 2004 thread suggests that we want a per database setting. I am
> unable to see how we share shared relations between databases with
> different case folder.
>
> pg_user is an example of this;
>
> Lowercase database; CREATE ROLE mrruss LOGIN; results in -> mrruss
> as data in pg_user
> Uppercase database; CREATE ROLE mrruss LOGIN; resutls in -> MRRUSS
> as data in pg_user
>
> Now both of those can be accessed from any database. And you will get
> a different user based on the source database.

You could use a setting, in, say, the control file, for the global
tables. IIRC there are only three such tables.

>
> Overall, I'd like to concentrate on the implementation as I'm a beginner.

I'm not sure this is a very good project for a beginner - but that's
your choice.

> But I see this being mainly a problem with nailing down the actual
> requirement for the implementation. So I'll try to start the
> discussion to allow me or somebody else to eventually develop a patch
> for this.
>
> The first question is, are all the requirements of the 2004 thread
> still true now?
>
> Setting case folder at initdb time seems the easiest method but I'm
> not sure if that's what people want. Any GUC variables seem to play
> havoc with the pg_catalog schema and the data in the catalogs.
>
> Ideas and comments?
>
>

Just getting to standard compliance will satisfy some, but I suspect not
many. What a lot of people want is case sensitivity, with no folding. I
think you need to look at that as an option.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-24 03:30:03
Message-ID: 18465.1206329403@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Russell Smith <mr-russ(at)pws(dot)com(dot)au> writes:
> As an attempt at a first PostgreSQL patch, I'd like to see if I can do
> anything about this issue.

Trying that as a first patch is a recipe for failure... the short answer
is that no one can think of a solution that will be generally acceptable.

regards, tom lane


From: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-25 10:57:07
Message-ID: 47E8DA83.2030400@pws.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Russell Smith <mr-russ(at)pws(dot)com(dot)au> writes:
>
>> As an attempt at a first PostgreSQL patch, I'd like to see if I can do
>> anything about this issue.
>>
>
> Trying that as a first patch is a recipe for failure... the short answer
> is that no one can think of a solution that will be generally acceptable.
>
> regards, tom lane
>
What makes this change particularly challenging? The fact that nobody
has agreed on how it should work, or the actual coding?

regards

Russell


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-25 13:09:38
Message-ID: 47E8F992.1030705@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Russell Smith wrote:
> Tom Lane wrote:
>> Russell Smith <mr-russ(at)pws(dot)com(dot)au> writes:
>>
>>> As an attempt at a first PostgreSQL patch, I'd like to see if I can
>>> do anything about this issue.
>>>
>>
>> Trying that as a first patch is a recipe for failure... the short answer
>> is that no one can think of a solution that will be generally
>> acceptable.
>>
>> regards, tom lane
>>
> What makes this change particularly challenging? The fact that nobody
> has agreed on how it should work, or the actual coding?
>
>

The widespread code impact of any change is a strong indicator against
doing this as a first patch. But newbie or not, as Tom says, nobody
should be starting to design, let alone cut code, until the actual
desired behaviour is agreed.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-25 15:39:33
Message-ID: 21051.1206459573@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Russell Smith <mr-russ(at)pws(dot)com(dot)au> writes:
> What makes this change particularly challenging? The fact that nobody
> has agreed on how it should work, or the actual coding?

Getting the right design is definitely challenging. The coding might or
might not be, depending on your design ...

regards, tom lane


From: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-27 16:30:21
Message-ID: 47EBCB9D.1070907@pws.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

It looks like most of the hard yards will be in getting some form of
consensus about what should be done for this TODO. I can't see a reason
not to get started on the design now. If a decision is not able to be
made after 4 years since the original discussion, is it worth removing
the TODO or letting it sit for another 4? But to the actual issue at hand.

Andrew Dunstan attempted to summarize the original 2004 thread
http://archives.postgresql.org/pgsql-hackers/2006-10/msg01545.php;

--
There was some discussion a couple of years ago on the -hackers list
about it, so you might like to review the archives. The consensus seemed
to be that behaviour would need to be set no later than createdb time.

The options I thought of were:

1. current postgres behaviour (we need to do this for legacy reasons, of
course, as well as to keep happy the legions who hate using upper case
for anything)

2. strictly spec compliant (same as current behaviour, but folding to
upper case for unquoted identifiers rather than lower)

3. fully case sensitive even for unquoted identifiers (not spec
compliant at all, but nevertheless possibly attractive especially for
people migrating from MS SQLServer, where it is an option, IIRC).
--

Supporting all 3 of these behaviours at initdb time is not too invasive
or complicated from my initial investigation. The steps appear to be;

1. parser has to parse incoming identifiers with the correct casing
changes. (currently downcase_truncate_identifier)
2. The output quoting needs to quote identifiers using the same rules as
the parser. (currently quote_identifier)
3. the client needs to know what quote rules are in place. (libpq:
PQfname, PQfnumber)
4. psql needs to \ commands to be taught about the fact that case can
mean different things to different servers.
5. bootstrap needs to correctly case the tables and insert values when
bootstrapping at initdb time. This is only really an issue for upper
case folding.

Many people appear advocate a 4th option to only want the column names
to be case preserved or upper cased. They expect other identifiers will
behave as they do now. This doesn't really bring us any closer to the
spec, it takes us away from it as Tom has suggested in the past. It
also appears to increase the complexity and invasiveness of a patch.
Being able to support case preservation/sensitivity for all identifiers
at initdb time appears to be no extra work than supporting the upper and
lower folding versions.

The discussions around having a name as supplied and a quoted version
allow lots of flexibility, probably even down to the session level.
However I personally am struggling to get my head around the corner
cases for that approach.

If this needs to be at createdb time, I think we add at least the
following complexities;

1. all relations cases must be altered when copied from the template
database or quoted when copied.
We have no idea what a template database might look like, all views
and functions would need to be parsed to ensure they point to valid tables.
2. shared relations must be able to be accessed using different names in
different databases, eg PG_DATABASE, pg_database.
3. The data in shared relations appears different to the same users in
different databases.
eg my unquoted username is MrRuss, in db1 (upper): MRRUSS, db2 (case
sensitive): MrRuss, db3 (lower): mrruss
My guts tells me that's going to lead to user confusion.

Dumping and restoring databases to different foldings can/will present
an interesting challenge and I'm not sure how to support that. I don't
even know if we want to support that officially.

I'm leaning towards initdb time, mainly because I think a patch can be
produced that isn't to invasive and is much easier to review and
actually get applied. I also think that adding the createdb time flags
will push this task beyond my ability to write up a patch. Either way
though, consensus on what implementation we actually want going forward
will enable some more skilled developer to do this without the pain of
having to flesh out the design.

In light of this email and the other comments Tom and Andrew have made,
it's very easy to say 'too hard, we can't get agreement'. I would have
thought that standards compliance would have been one good reason to
push forward with at least the upper case folding ability. Both of the
previous threads on this issue raised lots of questions about possible
options but there never seemed to be any knocking the ideas around and
getting consensus phase. I would like to at least nail down some of the
requirement, if not all. I have put forward my personal opinion, but I
expect that is not of significant value as there are many others with
much more experience than I.

Regards

Russell Smith


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-27 20:27:34
Message-ID: 1307.1206649654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Russell Smith <mr-russ(at)pws(dot)com(dot)au> writes:
> The options I thought of were:
> ...
> 3. fully case sensitive even for unquoted identifiers (not spec
> compliant at all, but nevertheless possibly attractive especially for
> people migrating from MS SQLServer, where it is an option, IIRC).

Actually, I think most of the complainers wish that we'd duplicate
mysql's behavior ... although some experimentation suggests that that
behavior is impossibly inconsistent. It looks to me like, at least for
myisam tables, table names are fully case-sensitive and column names are
fully not (but are stored and reported in the originally entered
casing). Function names also seem case-insensitive. I'm afraid to
check whether other table handlers might behave differently still :-(

> Supporting all 3 of these behaviours at initdb time is not too invasive
> or complicated from my initial investigation.

You are deliberately ignoring all the hard problems.

The issue here is not whether we can make the parser fold identifiers
in different ways. The issue is how we keep everything from breaking
afterwards. The problems mostly are faced by clients --- psql's \d
commands, pg_dump, etc. Consider as an example pg_dump's quote_ident
function, which has to decide if an identifier requires double-quoting
or not. If it doesn't know what case-folding rule will be used to read
the identifier, how can it make that decision?

Restricting the case folding choice to be frozen at initdb would
eliminate some of these problems, but hardly all of them.

IMHO this area bears a whole lot of similarity to the
backslashes-in-string-literals problem. We are moving extremely slowly
towards spec compliance in that area, but we all know that when we throw
the switch by making standard_conforming_strings default to ON, we
are going to hear howls of anguish from everywhere. Exposing apps to
different possible case-folding rules is going to be at least as painful.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-27 22:39:41
Message-ID: 20080327223941.GB15848@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2008 at 03:30:21AM +1100, Russell Smith wrote:
> 3. the client needs to know what quote rules are in place. (libpq:
> PQfname, PQfnumber)

The question I want to see answered, is how something like DBD::Pg will
handle this. If I wrote code like this in Perl:

my %hash = $res->get_row_as_hash();
print $hash{mycolumn};

Will this change break my code? Perl hashes are case-sensetive, you
can't change that. And it seems impossibly complex to fix the above
code to work with all the possible combinations... Which starts leading
you down the path of folding in some places and not others, which is
madness.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Russell Smith" <mr-russ(at)pws(dot)com(dot)au>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-27 23:46:23
Message-ID: 871w5v20j4.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Martijn van Oosterhout" <kleptog(at)svana(dot)org> writes:

> On Fri, Mar 28, 2008 at 03:30:21AM +1100, Russell Smith wrote:
>> 3. the client needs to know what quote rules are in place. (libpq:
>> PQfname, PQfnumber)
>
> The question I want to see answered, is how something like DBD::Pg will
> handle this. If I wrote code like this in Perl:
>
> my %hash = $res->get_row_as_hash();
> print $hash{mycolumn};
>
> Will this change break my code? Perl hashes are case-sensetive, you
> can't change that. And it seems impossibly complex to fix the above
> code to work with all the possible combinations... Which starts leading
> you down the path of folding in some places and not others, which is
> madness.

Well, DBI already has to deal with this anyways because it tries to provide a
database-independent interface. So you can instruct DBI to upcase, downcase,
or leave the identifiers as the database provides them by setting this
property on your database connection:

"FetchHashKeyName" (string, inherited)

The "FetchHashKeyName" attribute is used to specify whether the
fetchrow_hashref() method should perform case conversion on the field
names used for the hash keys. For historical reasons it defaults to
'"NAME"' but it is recommended to set it to '"NAME_lc"' (convert to lower
case) or '"NAME_uc"' (convert to upper case) according to your preference.
It can only be set for driver and database handles. For statement handles
the value is frozen when prepare() is called.

So if you've always been using unquoted identifiers you can set
FetchHashKeyName to NAME_lc and it would continue to work. If you've been
using a mixture of quoted and unquoted identifiers things would be trickier
though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Russell Smith" <mr-russ(at)pws(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date: 2008-03-28 00:57:20
Message-ID: 20336.1206665840@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Martijn van Oosterhout" <kleptog(at)svana(dot)org> writes:
>> Will this change break my code?

> Well, DBI already has to deal with this anyways because it tries to provide a
> database-independent interface. So you can instruct DBI to upcase, downcase,
> or leave the identifiers as the database provides them by setting this
> property on your database connection:

That's not a solution, that's a kluge with very obvious failure modes.

Now admittedly it's probably not *likely* that someone would use
identifiers differing only in case in a single table definition.
But it's legal, and in fact we're required by spec to support it.

regards, tom lane