Re: 7.4.1 release status - Turkish Locale

Lists: pgsql-hackers
From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: <pgsql-hackers(at)postgreSQL(dot)org>
Cc: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <devrim(at)tdmsoft(dot)com>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 01:55:39
Message-ID: 000701c3e866$8584d890$1d00a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> We might think that the Turkish-locale problem Devrim Gunduz pointed
out
> is a must-fix, too. But I'm not convinced yet what to do about it.

Here is a first try to fix what Devrim Gunduz talked about.

Please be patient with me for it is the first major patch
I submit and I realize that I blatantly violated many rules
of good style in PostgreSQL source code.

First, about the problem. Turkish language has two letters "i".
One is with dot on top and the other is without. Simply as that.
The one with dot has the dot both as capital and lower-case and
the one without dot has no dot in both upper and lower case...
as opposed to English where "i" has a dot when lower-case and
has no dot when upper-case.

Problem arise when PostgreSQL, while running with "tr_TR" locale
converts to lower-case an identifier as a table, an index or
a column name. If it is written with capital "I", tolower() with
'I' as argument will return Turkish specific character:
'i'-without-a-dot what I am afraid will not be shown correctly
in your e-mail readers.

Let me give some examples.

initdb script runs apparently innocent script in file
src/backend/utils/mb/conversion_procs/conversion_create.sql
to create a couple of functions whose only fault was
to declare it their return parameters as VOID. Backend
returns error message that type "vo d" is not found and
initdb fails.

A nothing suspecting novice user was excited about
SERIAL data type he was tail is present in PostgreSQL.
It took us with Devrim a lot of time to explain why he
need to type SERIAL as SERiAL for now till a workaround
is developed.

Another case happened with me when I wanted to restore
a pg_dump dump. Restore failed because script was creating
scripts that belong to PUBLIC.

For the solution, after some research we found out that
offender is tolower() call in src/backend/parser/scan.l
in {identifier} section. tolower() works fine with any
locale and with any character save for the Turkish locale
and capital 'I' character. So, the obvious solution is
to put a check for Turkish locale and 'I' character.
Something like this:

if( <locale is Turkish> && ident[i] == 'I' )
ident[i] = 'i';
else
ident[i] = tolower((unsigned char) ident[i]);

Looks rather simple but the hard part was to figure out
what is the current locale. To do this I added

const char *get_locale_category(const char *category);

to src/backend/utils/adt/pg_locale.c that would return
locale identifier for the category specified or LC_ALL
if category is NULL. I could not find any other function
that will return what I need. Please help me to find
one because I would hate to introduce a new function.

I realize that {identifier} section is very performance
critical so I introduced a global variable

static int isturkishlocale = -1;

at the beginning of src/backend/parser/scan.l
It is set to -1 when not yet initialized, 0 if
locale is not Turkish and 1 if locale is Turkish.

It might not be the way it is usually done in PostgreSQL
source code. Could you pleas advise if the name I chose
is appropriate and whether there is a more appropriate
place to put declaration and initialization.

Best regards,
Nicolai Tufar & Devrim Gunduz

Attachment Content-Type Size
trpatch.diff application/octet-stream 2.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ntufar(at)pisem(dot)net
Cc: pgsql-hackers(at)postgresql(dot)org, devrim(at)tdmsoft(dot)com
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 04:23:04
Message-ID: 11795.1075609384@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
>> We might think that the Turkish-locale problem Devrim Gunduz pointed out
>> is a must-fix, too. But I'm not convinced yet what to do about it.

> Here is a first try to fix what Devrim Gunduz talked about.

I still don't much like having a locale-specific wart in the parser
(and the code you give could not work anyway --- for starters, the
first argument of setlocale is not a pointer).

A possible compromise is to apply ASCII downcasing (same as in
keywords.c) for 7-bit-ASCII characters, and apply tolower() only
for character codes above 127. In other words

unsigned char ch = (unsigned char) ident[i];

if (ch >= 'A' && ch <= 'Z')
ch += 'a' - 'A';
else if (ch > 127 && isupper(ch))
ch = tolower(ch);
ident[i] = (char) ch;

In reasonably sane locales this will have the same effects as currently,
while in unsane locales it will ensure that basic-ASCII identifiers are
treated the way we want.

Comments?

regards, tom lane


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <devrim(at)tdmsoft(dot)com>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 07:56:29
Message-ID: 001801c3e899$294b8000$7a00a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I still don't much like having a locale-specific wart in the parser
> (and the code you give could not work anyway --- for starters, the
> first argument of setlocale is not a pointer).

Aw, I see, my code broken. I got confused by locale_......_asign()
family
if functions. Sure, first argument need to be int. But as you said the
code
is a wart.

> A possible compromise is to apply ASCII downcasing (same as in
> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> for character codes above 127. In other words
>
> unsigned char ch = (unsigned char) ident[i];
>
> if (ch >= 'A' && ch <= 'Z')
> ch += 'a' - 'A';
> else if (ch > 127 && isupper(ch))
> ch = tolower(ch);
> ident[i] = (char) ch;
>
> In reasonably sane locales this will have the same effects as
currently,
> while in unsane locales it will ensure that basic-ASCII identifiers
are
> treated the way we want.

If we go this way why not make a special case only and only for 'I'
Character and not all 7-bit ASCII:

unsigned char ch = (unsigned char) ident[i];

if(ch == (unsigned char)'I')
ch = 'i';
else
ch = tolower(ch)
ident[i] = (char) ch;

Will it break any locales?

>
> regards, tom lane

Regards
Nicolai


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ntufar(at)pisem(dot)net
Cc: pgsql-hackers(at)postgresql(dot)org, devrim(at)tdmsoft(dot)com
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 08:17:26
Message-ID: 15123.1075623446@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
>> A possible compromise is to apply ASCII downcasing (same as in
>> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
>> for character codes above 127. In other words

> If we go this way why not make a special case only and only for 'I'
> Character and not all 7-bit ASCII:

It seems to me that that's too narrow a definition of the problem.
I think we should state our goal as "we don't want bizarre locale
definitions to interfere with downcasing of the basic ASCII letters".
If we put in a special case for 'I' we will fix the known problem
with Turkish, but what other strange locales might be out there?
And if we don't trust tolower() for 'I', why should we trust it
for 'A'-'Z'?

What it comes down to is that by training and experience, I always
expect that any bug might be just one example of a whole class of bugs.
You have to look for the related cases that might happen in future,
not only fix the case that's under your nose.

regards, tom lane


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, <devrim(at)tdmsoft(dot)com>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-01 09:50:01
Message-ID: 000001c3e8a8$cac51810$7a00a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" tgl(at)sss(dot)pgh(dot)pa(dot)us wrote:
>
>"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
>>> A possible compromise is to apply ASCII downcasing (same as in
>>> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
>>> for character codes above 127. In other words
>
>> If we go this way why not make a special case only and only for 'I'
>> Character and not all 7-bit ASCII:
>
> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

To my knowledge no other locale have similar problems. At least nobody
complained so far while Turk users are rising their voices for many
years
now. Let try and put this very special case, together with an extensive
explanation in comment and see if someone complains. And by the way,
national characters in table, column, index or function names is
something
that never happens in production databases.

As for 'A'-'Z'^, it was pointed to me that SQL99 standard states that
identifier names need to be downcasted in locale-dependent manner.

Would you like me to create a patch that would touch only
src/backend/parser/scan.l, introduce a special case for 'I'
and include an explanation in comment?

> What it comes down to is that by training and experience, I always
> expect that any bug might be just one example of a whole class of
bugs.
> You have to look for the related cases that might happen in future,
> not only fix the case that's under your nose.
>
> regards, tom lane

Regards,
Nicolai Tufar


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <devrim(at)tdmsoft(dot)com>
Subject: Turkish Locale in Identifiers (contd.)
Date: 2004-02-03 19:31:14
Message-ID: 000a01c3ea8d$4280ee20$5400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
> >> A possible compromise is to apply ASCII downcasing (same as in
> >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> >> for character codes above 127. In other words
>
> > If we go this way why not make a special case only and only for 'I'
> > Character and not all 7-bit ASCII:
>
> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

Since nobody commented on the issue I may suggest a patch that
implements
'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was
rejected
before on basis of SQL99 compliance. I hope I would have more luck with
this
one. Because PostgreSQL just does not work with Turkish locale, and it
was
so since 7.4.0. initdb just chokes on VOID identifier and quits. Devrim
Gunduz will second me on this, I am sure.

With my knowledge of Russian, Arabic and -to some degree- Hebrew
encodings
I claim that this patch will not break them. If someone who uses far
eastern
Encodings would also check it, I think it would be pretty safe to apply
this patch to the source.

Thanks,
Nicolai Tufar


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <devrim(at)tdmsoft(dot)com>
Subject: Re: Turkish Locale in Identifiers (contd.)
Date: 2004-02-03 21:55:47
Message-ID: 000801c3eaa0$82d2dcf0$5400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oops, forgot the patch :)

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Nicolai Tufar
> Sent: Tuesday, February 03, 2004 9:31 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Cc: 'Tom Lane'; devrim(at)tdmsoft(dot)com
> Subject: [HACKERS] Turkish Locale in Identifiers (contd.)
>
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
> > >> A possible compromise is to apply ASCII downcasing (same as in
> > >> keywords.c) for 7-bit-ASCII characters, and apply tolower() only
> > >> for character codes above 127. In other words
> >
> > > If we go this way why not make a special case only and only for
'I'
> > > Character and not all 7-bit ASCII:
> >
> > It seems to me that that's too narrow a definition of the problem.
> > I think we should state our goal as "we don't want bizarre locale
> > definitions to interfere with downcasing of the basic ASCII
letters".
> > If we put in a special case for 'I' we will fix the known problem
> > with Turkish, but what other strange locales might be out there?
> > And if we don't trust tolower() for 'I', why should we trust it
> > for 'A'-'Z'?
>
> Since nobody commented on the issue I may suggest a patch that
> implements
> 'I' special case solution. 'A'-'Z' ASCII-only downcasting idea was
> rejected
> before on basis of SQL99 compliance. I hope I would have more luck
with
> this
> one. Because PostgreSQL just does not work with Turkish locale, and it
> was
> so since 7.4.0. initdb just chokes on VOID identifier and quits.
Devrim
> Gunduz will second me on this, I am sure.
>
> With my knowledge of Russian, Arabic and -to some degree- Hebrew
> encodings
> I claim that this patch will not break them. If someone who uses far
> eastern
> Encodings would also check it, I think it would be pretty safe to
apply
> this patch to the source.
>
> Thanks,
> Nicolai Tufar
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to
majordomo(at)postgresql(dot)org)

Attachment Content-Type Size
tr20040203.diff application/octet-stream 1.5 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-09 19:16:44
Message-ID: 87isigm5b7.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> It seems to me that that's too narrow a definition of the problem.
> I think we should state our goal as "we don't want bizarre locale
> definitions to interfere with downcasing of the basic ASCII letters".
> If we put in a special case for 'I' we will fix the known problem
> with Turkish, but what other strange locales might be out there?
> And if we don't trust tolower() for 'I', why should we trust it
> for 'A'-'Z'?

But then wouldn't it be a little weird for Turkish table and column names to
treat "I and "Ý" (I think that's a dotted capital I) as equivalent to "i"
instead of "ý" "i" respectively. (I think that first one was a dotless i).

Perhaps what really ought to be happening is that the downcasing should be
done separately for keywords, or postponed until the point where it's checked
to see if it's a keyword. Then it could be done using an entirely
ascii-centric bit-twiddling implementation.

If it matches an SQL keyword after being downcased the old fashioned way, then
it's an SQL keyword. If not then the locale-aware tolower() would be
appropriate for tables, columns, etc.

But then perhaps that's unnecessarily complex.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-09 19:51:30
Message-ID: 22808.1076356290@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> If it matches an SQL keyword after being downcased the old fashioned way, then
> it's an SQL keyword. If not then the locale-aware tolower() would be
> appropriate for tables, columns, etc.

That's exactly what we do already. The complaint was that the
locale-aware downcasing is broken (not to put it too finely) in Turkish
locales, leading to unexpected/unwanted results for identifiers that are
not keywords. My own opinion is that the correct response is to fix the
Turkish locale tables, but I can see where that might be beyond the
skills of the average Postgres user. Thus I thought a reasonable
compromise would be to override the locale for the handling of A-Z,
allowing it to determine what happens to high-bit-set characters only.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-09 21:45:21
Message-ID: 874qu0lyfi.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > If it matches an SQL keyword after being downcased the old fashioned way, then
> > it's an SQL keyword. If not then the locale-aware tolower() would be
> > appropriate for tables, columns, etc.
>
> That's exactly what we do already. The complaint was that the
> locale-aware downcasing is broken (not to put it too finely) in Turkish
> locales, leading to unexpected/unwanted results for identifiers that are
> not keywords.

But the example given was "SERIAL". "serial" is an English word, not a Turkish
word. It shouldn't really be subject to Turkish locale effects at all. Perhaps
"keyword" wasn't the right word in my message.

I'm wondering if he really expects all identifiers to be subject to this ascii
downcasing. Like, if he had a GÜNAYDIN column he might be surprised to when
günaydýn (where ý is the lowercase dotless i) says column "günaydýn" doesn't
exist.

Or is the real problem simply that both styles of i really ought to match all
the time, ie, that they should really be considered the same letter for
matches? I wonder if there are other locales where that's an issue.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-09 22:10:01
Message-ID: 24974.1076364601@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> But the example given was "SERIAL". "serial" is an English word, not a
> Turkish word. It shouldn't really be subject to Turkish locale effects
> at all.

SERIAL is not a keyword according to the grammar. Neither are PUBLIC,
VOID, INT4, and numerous other examples. It's not appropriate to try to
fix this by making them all keywords --- that will just create other
problems. (And where do you draw the line, anyway? Should every
identifier present in the default system catalogs become a keyword?)

> I'm wondering if he really expects all identifiers to be subject to
> this ascii downcasing.

Without doubt it isn't ideal, but if we don't do something then a lot of
stuff starting with initdb is broken. We could perhaps work around the
problem by spelling everything in lower-case in all the commands we
issue, but I can't see that as an acceptable answer either. We can't
expect to control all the SQL sent to a database.

regards, tom lane


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Greg Stark'" <gsstark(at)mit(dot)edu>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-19 20:31:08
Message-ID: 000601c3f727$5dbd7b30$6400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry for rising up old issue again but the problem still persists.
And database cluster is not being created with Turkish locale

If you have any doubts about how Turkish users will react to the fact
that both "I" and "I WITH DOT" will be treated as same character, rest
assured that this behavior is de-facto standard when it comes to file
names,
identifiers and commands. Greg Stark and Devrim Gunduz will confirm
that,
no doubt. Please review and apply this patch I send you for the third
time. You will not regret and many users will be grateful. Please note
that to my knowledge it will not break any other locales.

Best regards,
Nicolai

Attachment Content-Type Size
tr20040203.diff application/octet-stream 1.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ntufar(at)pisem(dot)net
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 00:37:50
Message-ID: 8092.1077323870@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
> Sorry for rising up old issue again but the problem still persists.
> And database cluster is not being created with Turkish locale

I've committed the attached fix, which I believe will solve this
problem. Could you test it?

(Patch is against 7.4 branch)

regards, tom lane

Attachment Content-Type Size
downcase.patch application/octet-stream 18.6 KB

From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 02:23:17
Message-ID: 000001c3f821$ac114cd0$6400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>
> I've committed the attached fix, which I believe will solve this
> problem. Could you test it?

Thank you very much for your effort and attention!

I am not sure I am testing the right version. I am testing the
one with REL7_4_STABLE, the one with downcase_truncate_identifier()
function added.

Under locale-ignorant FreeBSD it works fine.
But under Fedora Core 1 initdb it crashes under all
locales I tested -C, en_US, tr_TR with message given below.

I remember seeing this message before, when I messed up with downcasting
Functions. See, it is downcasting "ISO" and gets "ıso" in return. Could
Someone confirm the results I got?

Regards,
Nicolai

--------------------------------------------------------
fixing permissions on existing directory /pgdata... ok
creating directory /pgdata/base... ok
creating directory /pgdata/global... ok
creating directory /pgdata/pg_xlog... ok
creating directory /pgdata/pg_clog... ok
selecting default max_connections... 10
selecting default shared_buffers... 50
creating configuration files... ok
creating template1 database in /pgdata/base/1... FATAL: XX000: failed
to initialize DateStyle to "ISO, MDY"
LOCATION: InitializeGUCOptions, guc.c:1866

initdb: failed


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ntufar(at)pisem(dot)net
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 04:36:47
Message-ID: 9106.1077338207@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
> Under locale-ignorant FreeBSD it works fine.
> But under Fedora Core 1 initdb it crashes under all
> locales I tested -C, en_US, tr_TR with message given below.

Hmm. It seems that tr_TR has problems much more extensive than you've
indicated previously. I was able to get through initdb with the attached
additional patch, but the regression tests fail in several places.
It looks to me like every use of strcasecmp in the backend has to be
questioned if we're going to make this work. I'm starting to lean in
the direction of "tr_TR is hopelessly broken" again...

regards, tom lane

*** src/backend/commands/variable.c~ Mon Jan 19 14:04:40 2004
--- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004
***************
*** 82,103 ****

/* Ugh. Somebody ought to write a table driven version -- mjl */

! if (strcasecmp(tok, "ISO") == 0)
{
newDateStyle = USE_ISO_DATES;
scnt++;
}
! else if (strcasecmp(tok, "SQL") == 0)
{
newDateStyle = USE_SQL_DATES;
scnt++;
}
! else if (strncasecmp(tok, "POSTGRES", 8) == 0)
{
newDateStyle = USE_POSTGRES_DATES;
scnt++;
}
! else if (strcasecmp(tok, "GERMAN") == 0)
{
newDateStyle = USE_GERMAN_DATES;
scnt++;
--- 82,108 ----

/* Ugh. Somebody ought to write a table driven version -- mjl */

! /*
! * Note: SplitIdentifierString already downcased the input, so
! * we needn't use strcasecmp here.
! */
!
! if (strcmp(tok, "iso") == 0)
{
newDateStyle = USE_ISO_DATES;
scnt++;
}
! else if (strcmp(tok, "sql") == 0)
{
newDateStyle = USE_SQL_DATES;
scnt++;
}
! else if (strncmp(tok, "postgres", 8) == 0)
{
newDateStyle = USE_POSTGRES_DATES;
scnt++;
}
! else if (strcmp(tok, "german") == 0)
{
newDateStyle = USE_GERMAN_DATES;
scnt++;
***************
*** 105,129 ****
if (ocnt == 0)
newDateOrder = DATEORDER_DMY;
}
! else if (strcasecmp(tok, "YMD") == 0)
{
newDateOrder = DATEORDER_YMD;
ocnt++;
}
! else if (strcasecmp(tok, "DMY") == 0 ||
! strncasecmp(tok, "EURO", 4) == 0)
{
newDateOrder = DATEORDER_DMY;
ocnt++;
}
! else if (strcasecmp(tok, "MDY") == 0 ||
! strcasecmp(tok, "US") == 0 ||
! strncasecmp(tok, "NONEURO", 7) == 0)
{
newDateOrder = DATEORDER_MDY;
ocnt++;
}
! else if (strcasecmp(tok, "DEFAULT") == 0)
{
/*
* Easiest way to get the current DEFAULT state is to fetch
--- 110,134 ----
if (ocnt == 0)
newDateOrder = DATEORDER_DMY;
}
! else if (strcmp(tok, "ymd") == 0)
{
newDateOrder = DATEORDER_YMD;
ocnt++;
}
! else if (strcmp(tok, "dmy") == 0 ||
! strncmp(tok, "euro", 4) == 0)
{
newDateOrder = DATEORDER_DMY;
ocnt++;
}
! else if (strcmp(tok, "mdy") == 0 ||
! strcmp(tok, "us") == 0 ||
! strncmp(tok, "noneuro", 7) == 0)
{
newDateOrder = DATEORDER_MDY;
ocnt++;
}
! else if (strcmp(tok, "default") == 0)
{
/*
* Easiest way to get the current DEFAULT state is to fetch
***************
*** 474,480 ****
HasCTZSet = true;
}
}
! else if (strcasecmp(value, "UNKNOWN") == 0)
{
/*
* UNKNOWN is the value shown as the "default" for TimeZone in
--- 479,485 ----
HasCTZSet = true;
}
}
! else if (strcasecmp(value, "unknown") == 0)
{
/*
* UNKNOWN is the value shown as the "default" for TimeZone in


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 07:52:32
Message-ID: 000001c3f84f$ab542870$6400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Hmm. It seems that tr_TR has problems much more extensive than you've
> indicated previously. I was able to get through initdb with the
attached
> additional patch, but the regression tests fail in several places.
> It looks to me like every use of strcasecmp in the backend has to be
> questioned if we're going to make this work. I'm starting to lean in
> the direction of "tr_TR is hopelessly broken" again...

With this patch applied everything works fine. Thanks!
What do you plan to do? Apply it? And if apply, will
You apply both of the modifications to 7.5devel also?

Thanks again for your effort.

Best regards,
Nicolai

>
> regards, tom lane
>
>
> *** src/backend/commands/variable.c~ Mon Jan 19 14:04:40 2004
> --- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004
> ***************
> *** 82,103 ****
>
> /* Ugh. Somebody ought to write a table driven version
-- mjl
> */
>
> ! if (strcasecmp(tok, "ISO") == 0)
> {
> newDateStyle = USE_ISO_DATES;
> scnt++;
> }
> ! else if (strcasecmp(tok, "SQL") == 0)
> {
> newDateStyle = USE_SQL_DATES;
> scnt++;
> }
> ! else if (strncasecmp(tok, "POSTGRES", 8) == 0)
> {
> newDateStyle = USE_POSTGRES_DATES;
> scnt++;
> }
> ! else if (strcasecmp(tok, "GERMAN") == 0)
> {
> newDateStyle = USE_GERMAN_DATES;
> scnt++;
> --- 82,108 ----
>
> /* Ugh. Somebody ought to write a table driven version
-- mjl
> */
>
> ! /*
> ! * Note: SplitIdentifierString already downcased the
input, so
> ! * we needn't use strcasecmp here.
> ! */
> !
> ! if (strcmp(tok, "iso") == 0)
> {
> newDateStyle = USE_ISO_DATES;
> scnt++;
> }
> ! else if (strcmp(tok, "sql") == 0)
> {
> newDateStyle = USE_SQL_DATES;
> scnt++;
> }
> ! else if (strncmp(tok, "postgres", 8) == 0)
> {
> newDateStyle = USE_POSTGRES_DATES;
> scnt++;
> }
> ! else if (strcmp(tok, "german") == 0)
> {
> newDateStyle = USE_GERMAN_DATES;
> scnt++;
> ***************
> *** 105,129 ****
> if (ocnt == 0)
> newDateOrder = DATEORDER_DMY;
> }
> ! else if (strcasecmp(tok, "YMD") == 0)
> {
> newDateOrder = DATEORDER_YMD;
> ocnt++;
> }
> ! else if (strcasecmp(tok, "DMY") == 0 ||
> ! strncasecmp(tok, "EURO", 4) == 0)
> {
> newDateOrder = DATEORDER_DMY;
> ocnt++;
> }
> ! else if (strcasecmp(tok, "MDY") == 0 ||
> ! strcasecmp(tok, "US") == 0 ||
> ! strncasecmp(tok, "NONEURO", 7) == 0)
> {
> newDateOrder = DATEORDER_MDY;
> ocnt++;
> }
> ! else if (strcasecmp(tok, "DEFAULT") == 0)
> {
> /*
> * Easiest way to get the current DEFAULT state
is to
> fetch
> --- 110,134 ----
> if (ocnt == 0)
> newDateOrder = DATEORDER_DMY;
> }
> ! else if (strcmp(tok, "ymd") == 0)
> {
> newDateOrder = DATEORDER_YMD;
> ocnt++;
> }
> ! else if (strcmp(tok, "dmy") == 0 ||
> ! strncmp(tok, "euro", 4) == 0)
> {
> newDateOrder = DATEORDER_DMY;
> ocnt++;
> }
> ! else if (strcmp(tok, "mdy") == 0 ||
> ! strcmp(tok, "us") == 0 ||
> ! strncmp(tok, "noneuro", 7) == 0)
> {
> newDateOrder = DATEORDER_MDY;
> ocnt++;
> }
> ! else if (strcmp(tok, "default") == 0)
> {
> /*
> * Easiest way to get the current DEFAULT state
is to
> fetch
> ***************
> *** 474,480 ****
> HasCTZSet = true;
> }
> }
> ! else if (strcasecmp(value, "UNKNOWN") == 0)
> {
> /*
> * UNKNOWN is the value shown as the "default"
for
> TimeZone in
> --- 479,485 ----
> HasCTZSet = true;
> }
> }
> ! else if (strcasecmp(value, "unknown") == 0)
> {
> /*
> * UNKNOWN is the value shown as the "default"
for
> TimeZone in


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ntufar(at)pisem(dot)net
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 16:12:29
Message-ID: 11996.1077379949@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nicolai Tufar" <ntufar(at)pisem(dot)net> writes:
>> It looks to me like every use of strcasecmp in the backend has to be
>> questioned if we're going to make this work. I'm starting to lean in
>> the direction of "tr_TR is hopelessly broken" again...

> With this patch applied everything works fine. Thanks!

Did you try running the regression tests under tr_TR locale? It seems
a few bricks short of "fine" yet :-(

regards, tom lane


From: "Nicolai Tufar" <ntufar(at)pisem(dot)net>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Greg Stark'" <gsstark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.1 release status - Turkish Locale
Date: 2004-02-21 22:40:26
Message-ID: 000001c3f8cb$b5a0f520$6400a8c0@ntufar
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> -----Original Message-----
> From: Tom Lane
> Did you try running the regression tests under tr_TR locale? It seems
> a few bricks short of "fine" yet :-(

I run regression tests under tr_TR locale. To do this I hardcoded
Turkish locale in initdb in pg_regress.sh. Three tests failed, I
attached resulting diff.

With days of the week, the same problem is with downcasting occurs. I
think it is not that crucial, but the rest of the differences in the
file seem to be important. I was not able to interpret them.

Thanks,
Nicolai

>
> regards, tom lane

Attachment Content-Type Size
regression.diffs application/octet-stream 23.4 KB