Bug in UTF8-Validation Code?

Lists: pgsql-hackers
From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Bug in UTF8-Validation Code?
Date: 2007-03-13 11:00:58
Message-ID: 200703131200.58918.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I've a problem with a database, I can dump the database to a file, but
restoration fails, happens with 8.1.4.

Steps to reproduce:
create database testdb with encoding='UTF8';
\c testdb
create table test(x text);
insert into test values ('\244'); ==> Is akzepted, even if not UTF8.

pg_dump testdb -f testdb.dump -Fc
pg_restore -f testdb.dump -d testdb => fails with an error:
ERROR: invalid byte sequence for encoding "UTF8": 0xa4

The problem itself comes from a CSV file, which is imported with \copy without
proper quoting (so I have to fix this anyway), but I still think this is an
error, making restoration very complicated in such cases...

Or am I doing something completly wrong here?

Best regards,
Mario Weilguni


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mario Weilguni *EXTERN*" <mweilguni(at)sime(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 13:46:26
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201A87833@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mario Weilguni wrote:

> Steps to reproduce:
> create database testdb with encoding='UTF8';
> \c testdb
> create table test(x text);
> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.

This is working as expected, see the remark in
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
L-SYNTAX-STRINGS

"It is your responsibility that the byte sequences you create
are valid characters in the server character set encoding."

Yours,
Laurenz Albe


From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 13:55:45
Message-ID: 200703131455.45209.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
> Mario Weilguni wrote:
> > Steps to reproduce:
> > create database testdb with encoding='UTF8';
> > \c testdb
> > create table test(x text);
> > insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>
> This is working as expected, see the remark in
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
> L-SYNTAX-STRINGS
>
> "It is your responsibility that the byte sequences you create
> are valid characters in the server character set encoding."

In that case, pg_dump is doing wrong here and should quote the output. IMO it
cannot be defined as working as expected, when this makes any database dumps
worthless, without any warnings at dump-time.

pg_dump should output \244 itself in that case.

Best regards,
Mario Weilguni


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Mario Weilguni <mweilguni(at)sime(dot)com>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 14:12:55
Message-ID: 45F6B167.8070401@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mario Weilguni wrote:
> Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz:
>
>> Mario Weilguni wrote:
>>
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>>
>> This is working as expected, see the remark in
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ
>> L-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>> are valid characters in the server character set encoding."
>>
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it
> cannot be defined as working as expected, when this makes any database dumps
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.
>
>

The sentence quoted from the docs is perhaps less than a model of
clarity. I would take it to mean that no client-encoding ->
server-encoding translation will take place. Does it really mean that
the server will happily accept any escaped byte sequence, whether or not
it is valid for the server encoding? If so that seems ... odd.

cheers

andrew


From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 14:24:01
Message-ID: 200703131524.01929.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan:
> The sentence quoted from the docs is perhaps less than a model of
> clarity. I would take it to mean that no client-encoding ->
> server-encoding translation will take place. Does it really mean that
> the server will happily accept any escaped byte sequence, whether or not
> it is valid for the server encoding? If so that seems ... odd.

Yes, \octal sequences are accepted even if invalid. The problem is, pgdump
will happily dump those sequences as is, so in that case a char ascii 0xa4 is
emitted, and so the dump cannot be restored with pg_restore.

A dangerous feature IMO, and will make a lot of DB admins very unhappy if they
have to validate every day if the precious database dumps can be restored in
case of an error.

Best regards,
Mario Weilguni


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mario Weilguni *EXTERN*" <mweilguni(at)sime(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 14:50:25
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201AC06DC@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mario Weilguni wrote:
>>> Steps to reproduce:
>>> create database testdb with encoding='UTF8';
>>> \c testdb
>>> create table test(x text);
>>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>>
>> This is working as expected, see the remark in
>>
>> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
>>
>> "It is your responsibility that the byte sequences you create
>> are valid characters in the server character set encoding."
>
> In that case, pg_dump is doing wrong here and should quote the output. IMO it
> cannot be defined as working as expected, when this makes any database dumps
> worthless, without any warnings at dump-time.
>
> pg_dump should output \244 itself in that case.

True. Here is a test case on 8.2.3
(OS, database and client all use UTF8):

test=> CREATE TABLE test(x text);
CREATE TABLE
test=> INSERT INTO test VALUES ('correct: ä');
INSERT 0 1
test=> INSERT INTO test VALUES (E'incorrect: \244');
INSERT 0 1
test=> \q
laurenz:~> pg_dump -d -t test -f test.sql

Here is an excerpt from 'od -c test.sql':

0001040 e n z \n - - \n \n I N S E R T I
0001060 N T O t e s t V A L U E S
0001100 ( ' c o r r e c t : 303 244 ' ) ;
0001120 \n I N S E R T I N T O t e s
0001140 t V A L U E S ( ' i n c o r
0001160 r e c t : 244 ' ) ; \n \n \n - - \n

The invalid character (octal 244) is in the INSERT statement!

This makes psql gag:

test=> DROP TABLE test;
DROP TABLE
test=> \i test.sql
SET
SET
SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
INSERT 0 1
psql:test.sql:33: ERROR: invalid byte sequence for encoding "UTF8": 0xa4
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

A fix could be either that the server checks escape sequences for validity
or that pg_dump outputs invalid bytes as escape sequences.
Or pg_dump could stop with an error.
I think that the cleanest way would be the first.

Yours,
Laurenz Albe


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: Mario Weilguni *EXTERN* <mweilguni(at)sime(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 15:12:33
Message-ID: 45F6BF61.3060601@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Albe Laurenz wrote:
> A fix could be either that the server checks escape sequences for validity
>

This strikes me as essential. If the db has a certain encoding ISTM we
are promising that all the text data is valid for that encoding.

The question in my mind is how we help people to recover from the fact
that we haven't done that.

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, Mario Weilguni *EXTERN* <mweilguni(at)sime(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 15:38:01
Message-ID: 45F6C559.7060909@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Albe Laurenz wrote:
>> A fix could be either that the server checks escape sequences for
>> validity
>>
>
> This strikes me as essential. If the db has a certain encoding ISTM we
> are promising that all the text data is valid for that encoding.
>
> The question in my mind is how we help people to recover from the fact
> that we haven't done that.

Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
we had to use iconv?

Joshua D. Drake

>
> cheers
>
> andrew
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-13 15:42:35
Message-ID: 200703131642.36079.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> Andrew Dunstan wrote:
> > Albe Laurenz wrote:
> >> A fix could be either that the server checks escape sequences for
> >> validity
> >
> > This strikes me as essential. If the db has a certain encoding ISTM we
> > are promising that all the text data is valid for that encoding.
> >
> > The question in my mind is how we help people to recover from the fact
> > that we haven't done that.
>
> Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> we had to use iconv?
>

What issues? I've upgraded several 8.0 database to 8.1. without having to use
iconv. Did I miss something?

Regards,
Mario Weilguni


From: Michael Paesold <mpaesold(at)gmx(dot)at>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, Mario Weilguni *EXTERN* <mweilguni(at)sime(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-14 07:01:53
Message-ID: 45F79DE1.1070700@gmx.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Albe Laurenz wrote:
>> A fix could be either that the server checks escape sequences for
>> validity
>>
>
> This strikes me as essential. If the db has a certain encoding ISTM we
> are promising that all the text data is valid for that encoding.
>
> The question in my mind is how we help people to recover from the fact
> that we haven't done that.

I would also say that it's a bug that escape sequences can get characters
into the database that are not valid in the specified encoding. If you
compare the encoding to table constraints, there is no way to simply
"escape" a constraint check.

This seems to violate the principle of consistency in ACID. Additionally,
if you include pg_dump into ACID, it also violates durability, since it
cannot restore what it wrote itself.
Is there anything in the SQL spec that asks for such a behaviour? I guess not.

A DBA will usually not even learn about this issue until they are presented
with a failing restore.

Best Regards,
Michael Paesold


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Mario Weilguni <mweilguni(at)sime(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-14 07:29:49
Message-ID: 20070314072949.GA4862@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
> Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> > we had to use iconv?
>
> What issues? I've upgraded several 8.0 database to 8.1. without having to use
> iconv. Did I miss something?

http://www.postgresql.org/docs/8.1/interactive/release-8-1.html

"Some users are having problems loading UTF-8 data into 8.1.X. This
is because previous versions allowed invalid UTF-8 byte sequences
to be entered into the database, and this release properly accepts
only valid UTF-8 sequences. One way to correct a dumpfile is to run
the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."

--
Michael Fuhr


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Michael Paesold <mpaesold(at)gmx(dot)at>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, Mario Weilguni *EXTERN* <mweilguni(at)sime(dot)com>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-14 09:05:31
Message-ID: 200703141005.33119.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
> Is there anything in the SQL spec that asks for such a behaviour? I guess
> not.

I think that the octal escapes are a holdover from the single-byte days where
they were simply a way to enter characters that are difficult to find on a
keyboard. In today's multi-encoding world, it would make more sense if there
were an escape sequence for a *codepoint* which is then converted to the
actual encoding (if possible and valid) in the server. The meaning of
codepoint is, however, character set dependent as well.

The SQL standard supports escape sequences for Unicode codepoints, which I
think would be a very useful feature (try entering a UTF-8 character
bytewise ...), but it's a bit weird to implement and it's not clear how to
handle character sets other than Unicode.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-16 11:17:14
Message-ID: 200703161217.15110.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold:
> Andrew Dunstan wrote:
> >
> > This strikes me as essential. If the db has a certain encoding ISTM we
> > are promising that all the text data is valid for that encoding.
> >
> > The question in my mind is how we help people to recover from the fact
> > that we haven't done that.
>
> I would also say that it's a bug that escape sequences can get characters
> into the database that are not valid in the specified encoding. If you
> compare the encoding to table constraints, there is no way to simply
> "escape" a constraint check.
>
> This seems to violate the principle of consistency in ACID. Additionally,
> if you include pg_dump into ACID, it also violates durability, since it
> cannot restore what it wrote itself.
> Is there anything in the SQL spec that asks for such a behaviour? I guess
> not.
>
> A DBA will usually not even learn about this issue until they are presented
> with a failing restore.

Is there anything I can do to help with this problem? Maybe implementing a new
GUC variable that turns off accepting wrong encoded sequences (so DBAs still
can turn it on if they really depend on it)?

For me,

Best regards,
Mario Weilguni


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mario Weilguni *EXTERN*" <mweilguni(at)sime(dot)com>, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-16 12:25:06
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201B95600@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mario Weilguni wrote:
> Is there anything I can do to help with this problem? Maybe
implementing a new
> GUC variable that turns off accepting wrong encoded sequences (so DBAs
still
> can turn it on if they really depend on it)?

I think that this should be done away with unconditionally.
Or does anybody have a good point for allowing corrupt data
in text columns?
Maybe it is the way it is now because nobody could be bothered
to add the appropriate checks...

Yours,
Laurenz Albe


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: Mario Weilguni *EXTERN* <mweilguni(at)sime(dot)com>, Michael Paesold <mpaesold(at)gmx(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-16 13:06:08
Message-ID: 45FA9640.6090201@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Albe Laurenz wrote:
> Mario Weilguni wrote:
>
>> Is there anything I can do to help with this problem? Maybe
>>
> implementing a new
>
>> GUC variable that turns off accepting wrong encoded sequences (so DBAs
>>
> still
>
>> can turn it on if they really depend on it)?
>>
>
> I think that this should be done away with unconditionally.
> Or does anybody have a good point for allowing corrupt data
> in text columns?
> Maybe it is the way it is now because nobody could be bothered
> to add the appropriate checks...
>
>

I agree. It's more or less an integrity violation, IMNSHO.

cheers

andrew


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Mario Weilguni <mweilguni(at)sime(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-16 18:14:22
Message-ID: 1174068862.23455.331.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote:
> Hi,
>
> I've a problem with a database, I can dump the database to a file, but
> restoration fails, happens with 8.1.4.

I reported the same problem a while back:

http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php

Some people think it's a bug, some people don't. It is technically
documented behavior, but I don't think the documentation is clear
enough. I think it is a bug that should be fixed, and here's another
message in the thread that expresses my opinion:

http://archives.postgresql.org/pgsql-bugs/2006-11/msg00033.php

If you look at that email, it includes some examples of surprising
behaviors caused by that bug, particularly with bytea.

In some applications (for which it's impractical to change the source
code), I actually use a CHECK constraint (which raises an exception on
invalid utf8 data) on every text column so that some binary data doesn't
break my slony replication.

I'd like to see this fixed.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-16 18:23:08
Message-ID: 1174069388.23455.337.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
> On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
> > Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
> > > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
> > > we had to use iconv?
> >
> > What issues? I've upgraded several 8.0 database to 8.1. without having to use
> > iconv. Did I miss something?
>
> http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
>
> "Some users are having problems loading UTF-8 data into 8.1.X. This
> is because previous versions allowed invalid UTF-8 byte sequences
> to be entered into the database, and this release properly accepts
> only valid UTF-8 sequences. One way to correct a dumpfile is to run
> the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
>

If the above quote were actually true, then Mario wouldn't be having a
problem. Instead, it's half-true: Invalid byte sequences are rejected in
some situations and accepted in others. If postgresql consistently
rejected or consistently accepted invalid byte sequences, that would not
cause problems with COPY (meaning problems with pg_dump, slony, etc.).

Regards,
Jeff Davis


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 15:46:01
Message-ID: 45FC0D39.8090802@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis wrote:
> On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote:
>
>> On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote:
>>
>>> Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake:
>>>
>>>> Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where
>>>> we had to use iconv?
>>>>
>>> What issues? I've upgraded several 8.0 database to 8.1. without having to use
>>> iconv. Did I miss something?
>>>
>> http://www.postgresql.org/docs/8.1/interactive/release-8-1.html
>>
>> "Some users are having problems loading UTF-8 data into 8.1.X. This
>> is because previous versions allowed invalid UTF-8 byte sequences
>> to be entered into the database, and this release properly accepts
>> only valid UTF-8 sequences. One way to correct a dumpfile is to run
>> the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql."
>>
>>
>
> If the above quote were actually true, then Mario wouldn't be having a
> problem. Instead, it's half-true: Invalid byte sequences are rejected in
> some situations and accepted in others. If postgresql consistently
> rejected or consistently accepted invalid byte sequences, that would not
> cause problems with COPY (meaning problems with pg_dump, slony, etc.).
>
>
>

How can we fix this? Frankly, the statement in the docs warning about
making sure that escaped sequences are valid in the server encoding is a
cop-out. We don't accept invalid data elsewhere, and this should be no
different IMNSHO. I don't see why this should be any different from,
say, date or numeric data. For years people have sneered at MySQL
because it accepted dates like Feb 31st, and rightly so. But this seems
to me to be like our own version of the same problem.

Last year Jeff suggested adding something like:

pg_verifymbstr(string,strlen(string),0);

to each relevant input routine. Would that be an acceptable solution? If
not, what would be?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 16:10:06
Message-ID: 11744.1174147806@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Last year Jeff suggested adding something like:
> pg_verifymbstr(string,strlen(string),0);
> to each relevant input routine. Would that be an acceptable solution?

The problem with that is that it duplicates effort: in many cases
(especially COPY IN) the data's already been validated. I'm not sure
how to fix that, but I think you'll get some push-back if you double
the encoding verification work in COPY for nothing.

Given that we are moving away from backslash-enabled literals, I'm
not as convinced as some that this must be fixed...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 16:23:07
Message-ID: 45FC15EB.8030508@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Last year Jeff suggested adding something like:
>> pg_verifymbstr(string,strlen(string),0);
>> to each relevant input routine. Would that be an acceptable solution?
>>
>
> The problem with that is that it duplicates effort: in many cases
> (especially COPY IN) the data's already been validated. I'm not sure
> how to fix that, but I think you'll get some push-back if you double
> the encoding verification work in COPY for nothing.
>
> Given that we are moving away from backslash-enabled literals, I'm
> not as convinced as some that this must be fixed...
>
>
>
>

They will still be available in E'\nn' form, won't they?

One thought I had was that it might make sense to have a flag that would
inhibit the check, that could be set (and reset) by routines that check
for themselves, such as COPY IN. Then bulk load performance should not
be hit much.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 17:29:48
Message-ID: 12375.1174152588@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> The problem with that is that it duplicates effort: in many cases
>> (especially COPY IN) the data's already been validated.

> One thought I had was that it might make sense to have a flag that would
> inhibit the check, that could be set (and reset) by routines that check
> for themselves, such as COPY IN. Then bulk load performance should not
> be hit much.

Actually, I have to take back that objection: on closer look, COPY
validates the data only once and does so before applying its own
backslash-escaping rules. So there is a risk in that path too.

It's still pretty annoying to be validating the data twice in the
common case where no backslash reduction occurred, but I'm not sure
I see any good way to avoid it. I don't much want to add another
argument to input functions, and the global flag that you suggest
above seems too ugly/risky.

Would someone do some performance checking on the cost of adding
mbverify to textin()? If it could be shown that it adds only
negligible overhead to COPY, on say hundred-byte-wide text fields,
then we could decide that this isn't worth worrying about.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 17:51:02
Message-ID: 12557.1174153862@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Actually, I have to take back that objection: on closer look, COPY
> validates the data only once and does so before applying its own
> backslash-escaping rules. So there is a risk in that path too.

> It's still pretty annoying to be validating the data twice in the
> common case where no backslash reduction occurred, but I'm not sure
> I see any good way to avoid it.

Further thought here: if we put encoding verification into textin()
and related functions, could we *remove* it from COPY IN, in the common
case where client and server encodings are the same? Currently, copy.c
forces a trip through pg_client_to_server for multibyte encodings
even when the encodings are the same, so as to perform validation.
But I'm wondering whether we'd still need that. There's no risk of
SQL injection in COPY data. Bogus input encoding could possibly
make for confusion about where the field boundaries are, but bad
data is bad data in any case.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 20:28:53
Message-ID: 45FC4F85.7090804@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
>
>> Actually, I have to take back that objection: on closer look, COPY
>> validates the data only once and does so before applying its own
>> backslash-escaping rules. So there is a risk in that path too.
>>
>
>
>> It's still pretty annoying to be validating the data twice in the
>> common case where no backslash reduction occurred, but I'm not sure
>> I see any good way to avoid it.
>>
>
> Further thought here: if we put encoding verification into textin()
> and related functions, could we *remove* it from COPY IN, in the common
> case where client and server encodings are the same? Currently, copy.c
> forces a trip through pg_client_to_server for multibyte encodings
> even when the encodings are the same, so as to perform validation.
> But I'm wondering whether we'd still need that. There's no risk of
> SQL injection in COPY data. Bogus input encoding could possibly
> make for confusion about where the field boundaries are, but bad
> data is bad data in any case.
>
> regards, tom lane
>
>

Here are some timing tests in 1m rows of random utf8 encoded 100 char
data. It doesn't look to me like the saving you're suggesting is worth
the trouble.

baseline:

Time: 28228.325 ms
Time: 25987.740 ms
Time: 25950.707 ms
Time: 25756.371 ms
Time: 27589.719 ms
Time: 25774.417 ms

after adding suggested extra test to textin():

Time: 26722.376 ms
Time: 28343.226 ms
Time: 26529.364 ms
Time: 28020.140 ms
Time: 24836.853 ms
Time: 24860.530 ms

Script is:

\timing
create table xyz (x text);
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
truncate xyz;
copy xyz from '/tmp/utf8.data';
drop table xyz;

Test platform: FC6, Athlon64.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 21:58:21
Message-ID: 18930.1174168701@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Here are some timing tests in 1m rows of random utf8 encoded 100 char
> data. It doesn't look to me like the saving you're suggesting is worth
> the trouble.

Hmm ... not sure I believe your numbers. Using a test file of 1m lines
of 100 random latin1 characters converted to utf8 (thus, about half and
half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
encoding:

regression=# \timing
Timing is on.
regression=# create temp table test(f1 text);
CREATE TABLE
Time: 5.047 ms
regression=# copy test from '/home/tgl/zzz1m';
COPY 1000000
Time: 4337.089 ms

and this in UTF8 encoding:

utf8=# \timing
Timing is on.
utf8=# create temp table test(f1 text);
CREATE TABLE
Time: 5.108 ms
utf8=# copy test from '/home/tgl/zzz1m';
COPY 1000000
Time: 7776.583 ms

The numbers aren't super repeatable, but it sure looks to me like the
encoding check adds at least 50% to the runtime in this example; so
doing it twice seems unpleasant.

(This is CVS HEAD, compiled without assert checking, on an x86_64
Fedora Core 6 box.)

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-17 23:09:07
Message-ID: 45FC7513.8040206@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char
>> data. It doesn't look to me like the saving you're suggesting is worth
>> the trouble.
>>
>
> Hmm ... not sure I believe your numbers. Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>

Are you comparing apples with apples? The db is utf8 in both of my cases.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 04:03:58
Message-ID: 45FCBA2E.7010303@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Here are some timing tests in 1m rows of random utf8 encoded 100 char
>> data. It doesn't look to me like the saving you're suggesting is worth
>> the trouble.
>>
>
> Hmm ... not sure I believe your numbers. Using a test file of 1m lines
> of 100 random latin1 characters converted to utf8 (thus, about half and
> half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII
> encoding:
>
> regression=# \timing
> Timing is on.
> regression=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.047 ms
> regression=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 4337.089 ms
>
> and this in UTF8 encoding:
>
> utf8=# \timing
> Timing is on.
> utf8=# create temp table test(f1 text);
> CREATE TABLE
> Time: 5.108 ms
> utf8=# copy test from '/home/tgl/zzz1m';
> COPY 1000000
> Time: 7776.583 ms
>
> The numbers aren't super repeatable, but it sure looks to me like the
> encoding check adds at least 50% to the runtime in this example; so
> doing it twice seems unpleasant.
>
> (This is CVS HEAD, compiled without assert checking, on an x86_64
> Fedora Core 6 box.)
>
>
>

Here are some test results that are closer to yours. I used a temp table
and had cassert off and fsync off, and tried with several encodings.

The additional load from the test isn't 50%, (I think you have added the
cost of going from ascii to utf8 to the cost of the test to get that
50%) but it is nevertheless appreciable.

I agree that we should look at not testing if the client and server
encodings are the same, so we can reduce the difference.

cheers

andrew

Run SQL_ASCII LATIN1 UTF8

1 4659.38 4766.07 9134.53

2 7999.64 4003.13 6231.41

3 4178.46 6178.89 7266.39

Without test 4 4201.7 3930.84 10154.38

5 4092.44 4444.52 9438.24

6 3977.34 4197.09 8866.56

Average 4851.49 4586.76 8515.25

1 11993.86 12625.8 10109.89

2 4647.16 9192.53 11251.27

With test 3 4211.02 9903.77 10097.37

4 9203.62 7045.06 10372.25

5 4121.39 4138.78 10386.92

6 3722.73 4552.09 7432.56

Average 6316.63 7909.67 9941.71


From: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 09:32:29
Message-ID: C5179912-B158-4F69-BFB7-45A506708296@pointblue.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

<evil mode1>
Maybe we should add as resurce intensive check to ascii encoding(s),
that would even the score ;p
</evil mode1>

<evil mode 2>
let's test mysql on this, and see how worse does it perform.
</evil mode 2>

--
Grzegorz 'the evil' Jaskiewicz

evil C/evil C++ developer for hire


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 11:36:22
Message-ID: 20070318113622.GA5722@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
> How can we fix this? Frankly, the statement in the docs warning about
> making sure that escaped sequences are valid in the server encoding is a
> cop-out. We don't accept invalid data elsewhere, and this should be no
> different IMNSHO. I don't see why this should be any different from,
> say, date or numeric data. For years people have sneered at MySQL
> because it accepted dates like Feb 31st, and rightly so. But this seems
> to me to be like our own version of the same problem.

It seems to me that the easiest solution would be to forbid \x?? escape
sequences where it's greater than \x7F for UTF-8 server encodings.
Instead introduce a \u escape for specifying the unicode character
directly. Under the basic principle that any escape sequence still has
to represent a single character. The result can be multiple bytes, but
you don't have to check for consistancy anymore.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 12:25:56
Message-ID: 45FD2FD4.8070406@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote:
>
>> How can we fix this? Frankly, the statement in the docs warning about
>> making sure that escaped sequences are valid in the server encoding is a
>> cop-out. We don't accept invalid data elsewhere, and this should be no
>> different IMNSHO. I don't see why this should be any different from,
>> say, date or numeric data. For years people have sneered at MySQL
>> because it accepted dates like Feb 31st, and rightly so. But this seems
>> to me to be like our own version of the same problem.
>>
>
> It seems to me that the easiest solution would be to forbid \x?? escape
> sequences where it's greater than \x7F for UTF-8 server encodings.
> Instead introduce a \u escape for specifying the unicode character
> directly. Under the basic principle that any escape sequence still has
> to represent a single character. The result can be multiple bytes, but
> you don't have to check for consistancy anymore.
>
> Have a nice day,
>

The escape processing is actually done in the lexer in the case of
literals. We have to allow for bytea literals there too, regardless of
encoding. The lexer naturally has no notion of the intended destination
of the literal, So we need to defer the validity check to the *in
functions for encoding-aware types. And it as Tom has noted, COPY does
its own escape processing but does it before the transcoding.

So ISTM that any solution other than something like I have proposed will
probably involve substantial surgery.

It does also seem from my test results that transcoding to MB charsets
(or at least to utf-8) is surprisingly expensive, and that this would be
a good place to look at optimisation possibilities. The validity tests
can also be somewhat expensive.

But correctness matters most, IMNSHO.

cheers

andrew


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 15:52:08
Message-ID: 20070318155208.GB5722@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 18, 2007 at 08:25:56AM -0400, Andrew Dunstan wrote:
> It does also seem from my test results that transcoding to MB charsets
> (or at least to utf-8) is surprisingly expensive, and that this would be
> a good place to look at optimisation possibilities. The validity tests
> can also be somewhat expensive.

Hmm, I just noticed that the verify string works one character at a
time, at least that part could be dramatically optimised.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 21:07:37
Message-ID: 45FDAA19.1010309@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
>
> The escape processing is actually done in the lexer in the case of
> literals. We have to allow for bytea literals there too, regardless of
> encoding. The lexer naturally has no notion of the intended
> destination of the literal, So we need to defer the validity check to
> the *in functions for encoding-aware types. And it as Tom has noted,
> COPY does its own escape processing but does it before the transcoding.
>
> So ISTM that any solution other than something like I have proposed
> will probably involve substantial surgery.

Below is a list of the input routines in the adt directory, courtesy of
grep.

I'm thinking we will need to put checks in:

varcharin
bpcharin
textin
unknownin (?)
namein (?)

Any others?

cheers

andrew

acl.c:aclitemin
bool.c:boolin
char.c:charin
date.c:timetypmodin
date.c:timetztypmodin
float.c:dasin
float.c:dsin
nabstime.c:abstimein
nabstime.c:reltimein
nabstime.c:tintervalin
name.c:namein
not_in.c:oidnotin
numeric.c:numerictypmodin
oid.c:oidin
oid.c:oidvectorin
regproc.c:regprocin
regproc.c:regprocedurein
regproc.c:regoperin
regproc.c:regoperatorin
regproc.c:regclassin
regproc.c:regtypein
tid.c:tidin
timestamp.c:timestamptypmodin
timestamp.c:timestamptztypmodin
timestamp.c:intervaltypmodin
varbit.c:bittypmodin
varbit.c:varbittypmodin
varchar.c:bpcharin
varchar.c:bpchartypmodin
varchar.c:varcharin
varchar.c:varchartypmodin
varlena.c:byteain
varlena.c:textin
varlena.c:unknownin
xid.c:xidin
xid.c:cidin


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Michael Fuhr" <mike(at)fuhr(dot)org>, "Mario Weilguni" <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-18 23:33:21
Message-ID: 87lkhujfem.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:

> Below is a list of the input routines in the adt directory, courtesy of grep.

Grep isn't a good way to get these, your list missed a bunch.

postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type);
prosrc
---------------------
abstimein
aclitemin
any_in
anyarray_in
anyelement_in
array_in
bit_in
boolin
box_in
bpcharin
byteain
cash_in
charin
cidin
cidr_in
circle_in
cstring_in
date_in
domain_in
float4in
float8in
inet_in
int2in
int2vectorin
int4in
int8in
internal_in
interval_in
language_handler_in
line_in
lseg_in
macaddr_in
namein
numeric_in
oidin
oidvectorin
opaque_in
path_in
point_in
poly_in
record_in
regclassin
regoperatorin
regoperin
regprocedurein
regprocin
regtypein
reltimein
smgrin
textin
tidin
time_in
timestamp_in
timestamptz_in
timetz_in
tintervalin
trigger_in
unknownin
uuid_in
varbit_in
varcharin
void_in
xidin
xml_in
(64 rows)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-19 00:35:05
Message-ID: 45FDDAB9.6060009@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Andrew Dunstan" <andrew(at)dunslane(dot)net> writes:
>
>
>> Below is a list of the input routines in the adt directory, courtesy of grep.
>>
>
> Grep isn't a good way to get these, your list missed a bunch.
>
> postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type);
>
>
[snip]
> (64 rows)
>
>

Ok, good point. Now, which of those need to have a check for valid encoding?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, Mario Weilguni <mweilguni(at)sime(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-19 03:40:35
Message-ID: 20879.1174275635@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Ok, good point. Now, which of those need to have a check for valid encoding?

The vast majority will barf on any non-ASCII character anyway ... only
the ones that don't will need a check.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Mario Weilguni <mweilguni(at)sime(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-19 05:00:23
Message-ID: 20070319134928.6B5E.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> Some people think it's a bug, some people don't. It is technically
> documented behavior, but I don't think the documentation is clear
> enough. I think it is a bug that should be fixed, and here's another
> message in the thread that expresses my opinion:

Agreed. I think it is a bug, too.
Insertion of invalid characters makes read queries busted.

$ initdb --encoding=utf8
# CREATE TABLE test (t text);
# INSERT INTO test VALUES('A');
# SELECT * FROM test;
t
---
A
(1 row)

# INSERT INTO test VALUES(E'\200');
# SELECT * FROM test;
ERROR: invalid byte sequence for encoding "UTF8": 0x80
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is cont rolled by "client_encoding".

Could it lead to DoS?
http://www.postgresql.org/support/security
| [D] A vulnerability that is exploitable for denial-of-service,
| but requiring a valid prior login.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Mario Weilguni <mweilguni(at)sime(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-19 12:42:35
Message-ID: 200703191342.36135.mweilguni@sime.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Sonntag, 18. März 2007 12:36 schrieb Martijn van Oosterhout:
>
> It seems to me that the easiest solution would be to forbid \x?? escape
> sequences where it's greater than \x7F for UTF-8 server encodings.
> Instead introduce a \u escape for specifying the unicode character
> directly. Under the basic principle that any escape sequence still has
> to represent a single character. The result can be multiple bytes, but
> you don't have to check for consistancy anymore.

Would'nt the best solution to change the order of
validation/convert_backslash_sequences better? First convert the sequences,
and reject them in the validation stage?

Regards
Mario Weilguni


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Mario Weilguni <mweilguni(at)sime(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-03-22 20:54:57
Message-ID: 200703222054.l2MKsvq08654@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

* Fix cases where invalid byte encodings are accepted by the database,
but throw an error on SELECT

http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php

Is anyone working on fixing this bug?

---------------------------------------------------------------------------

Mario Weilguni wrote:
> Hi,
>
> I've a problem with a database, I can dump the database to a file, but
> restoration fails, happens with 8.1.4.
>
> Steps to reproduce:
> create database testdb with encoding='UTF8';
> \c testdb
> create table test(x text);
> insert into test values ('\244'); ==> Is akzepted, even if not UTF8.
>
> pg_dump testdb -f testdb.dump -Fc
> pg_restore -f testdb.dump -d testdb => fails with an error:
> ERROR: invalid byte sequence for encoding "UTF8": 0xa4
>
> The problem itself comes from a CSV file, which is imported with \copy without
> proper quoting (so I have to fix this anyway), but I still think this is an
> error, making restoration very complicated in such cases...
>
> Or am I doing something completly wrong here?
>
> Best regards,
> Mario Weilguni
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 00:04:01
Message-ID: 460EF6F1.7000006@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Added to TODO:
>
> * Fix cases where invalid byte encodings are accepted by the database,
> but throw an error on SELECT
>
> http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
>
> Is anyone working on fixing this bug?

Hi, has anyone volunteered to fix this bug? I did not see any reply on the
mailing list to your question above.

mark


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 02:47:21
Message-ID: 460F1D39.8010709@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> Bruce Momjian wrote:
>> Added to TODO:
>>
>> * Fix cases where invalid byte encodings are accepted by the
>> database,
>> but throw an error on SELECT
>>
>> http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
>>
>> Is anyone working on fixing this bug?
>
> Hi, has anyone volunteered to fix this bug? I did not see any reply on
> the mailing list to your question above.
>
> mark

OK, I can take a stab at fixing this. I'd like to state some assumptions so
people can comment and reply:

I assume that I need to fix *all* cases where invalid byte encodings get into
the database through functions shipped in the core distribution.

I assume I do not need to worry about people getting bad data into the system
through their own database extensions.

I assume that the COPY problem discussed up-thread goes away once you eliminate
all the paths by which bad data can get into the system. However, existing
database installations with bad data already loaded will not be magically fixed
with these code patches.

Do any of the string functions (see
http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the
risk of generating invalid utf8 encoded strings? Do I need to add checks? Are
there known bugs with these functions in this regard?

If not, I assume I can add mbverify calls to the various input routines (textin,
varcharin, etc) where invalid utf8 could otherwise enter the system.

I assume that this work can be limited to HEAD and that I don't need to
back-patch it. (I suspect this assumption is a contentious one.)

Advice and comments are welcome,

mark


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 10:30:51
Message-ID: 20070401103051.GB15919@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote:
> OK, I can take a stab at fixing this. I'd like to state some assumptions
> so people can comment and reply:
>
> I assume that I need to fix *all* cases where invalid byte encodings get
> into the database through functions shipped in the core distribution.

Yes.

> I assume I do not need to worry about people getting bad data into the
> system through their own database extensions.

That'd be rather difficult :)

> I assume that the COPY problem discussed up-thread goes away once you
> eliminate all the paths by which bad data can get into the system.
> However, existing database installations with bad data already loaded will
> not be magically fixed with these code patches.

Correct.

> Do any of the string functions (see
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run
> the risk of generating invalid utf8 encoded strings? Do I need to add
> checks? Are there known bugs with these functions in this regard?

I don't think so. They'd be bugs if they were...

> If not, I assume I can add mbverify calls to the various input routines
> (textin, varcharin, etc) where invalid utf8 could otherwise enter the
> system.

The only hard part is handling where the escaping and unescaping is
happening...

> I assume that this work can be limited to HEAD and that I don't need to
> back-patch it. (I suspect this assumption is a contentious one.)

At the very least I'd start with HEAD. Whether it gets backpatched
probably depends on how invasive it ends up being...

There's also the performance angle. The current mbverify is very
inefficient for encodings like UTF-8. You might need to refactor a bit
there...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 15:44:17
Message-ID: slrnf0vkqh.2i67.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2007-04-01, Mark Dilger <pgsql(at)markdilger(dot)com> wrote:
> Do any of the string functions (see
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the
> risk of generating invalid utf8 encoded strings? Do I need to add checks?
> Are there known bugs with these functions in this regard?

The chr() function returns an octet, rather than a character; this is clearly
wrong and needs fixing.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-01 19:39:54
Message-ID: 46100A8A.5030006@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> There's also the performance angle. The current mbverify is very
> inefficient for encodings like UTF-8. You might need to refactor a bit
> there...

There appears to be a lot of function call overhead in the current
implementation. In pg_verify_mbstr, the function pointer
pg_wchar_table.mbverify is called for each multibyte character in a multibyte
string.

Refactoring the way these table driven functions work would impact lots of other
code. Just grep for all files #including mb/pg_wchar.h for the list of them.
The list includes interfaces/libpq, and I'm wondering if software that links
against postgres might rely on these function prototypes?

mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 04:32:52
Message-ID: 24857.1175488372@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> Refactoring the way these table driven functions work would impact
> lots of other code. Just grep for all files #including mb/pg_wchar.h
> for the list of them. The list includes interfaces/libpq, and I'm
> wondering if software that links against postgres might rely on these
> function prototypes?

No, we've never exported those with the intent that client code should
use 'em. Doing so would require importing non-public headers, and
anyone who does that can have no grounds for complaining if the headers
change incompatibly.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql(at)markdilger(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 04:49:58
Message-ID: 20070402.134958.58463924.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> > Refactoring the way these table driven functions work would impact
> > lots of other code. Just grep for all files #including mb/pg_wchar.h
> > for the list of them. The list includes interfaces/libpq, and I'm
> > wondering if software that links against postgres might rely on these
> > function prototypes?
>
> No, we've never exported those with the intent that client code should
> use 'em. Doing so would require importing non-public headers, and
> anyone who does that can have no grounds for complaining if the headers
> change incompatibly.

I thought PQescapeString() of 8.3 uses mbverify functions to make sure
that user supplied multibyte string is valid.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgsql(at)markdilger(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 05:08:01
Message-ID: 25298.1175490481@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>> No, we've never exported those with the intent that client code should
>> use 'em.

> I thought PQescapeString() of 8.3 uses mbverify functions to make sure
> that user supplied multibyte string is valid.

Certainly --- but we can change PQescapeString to match whatever we do
with the pg_wchar functions. The question was whether we intend to
support client application code (outside libpq) using those functions.
That's definitely not the intent. exports.txt lists only PQmblen and
pg_utf_mblen as exported (and I have to wonder why the latter is
separately exported...), which means that client code on modern
platforms isn't even capable of getting at the others.

regards, tom lane


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: andrew(at)supernews(dot)com
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 20:35:20
Message-ID: 46116908.8040702@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew - Supernews wrote:
> On 2007-04-01, Mark Dilger <pgsql(at)markdilger(dot)com> wrote:
>> Do any of the string functions (see
>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the
>> risk of generating invalid utf8 encoded strings? Do I need to add checks?
>> Are there known bugs with these functions in this regard?
>
> The chr() function returns an octet, rather than a character; this is clearly
> wrong and needs fixing.
>

Ok, I've altered the chr() function. I am including a transcript from psql
below. There are several design concerns:

1) In the current implementation, chr(0) returns a 5-byte text object (4-bytes
of overhead plus one byte of data) containing a null. In the new
implementation, this returns an error. I don't know, but it is possible that
people currently use things like "SELECT chr(0) || chr(0) || ..." to build up
strings of nulls.

2) Under utf8, chr(X) fails for X = 128..255. This may also break current users
expectations.

3) The implicit modulus operation that was being performed by chr() is now gone,
which might break some users.

4) You can't represent the high end of the astral plain with type INTEGER,
unless you pass in a negative value, which is somewhat unintuitive. Since chr()
expects an integer (and not a bigint) the user needs handle the sign bit correctly.

mark

---------------------

Welcome to psql 8.3devel, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

pgsql=# select chr(0);
ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
pgsql=# select chr(65);
chr
-----
A
(1 row)

pgsql=# select chr(128);
ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
pgsql=# select chr(53398);
chr
-----
Ж
(1 row)

pgsql=# select chr(14989485);
chr
-----

(1 row)

pgsql=# select chr(4036005254);
ERROR: function chr(bigint) does not exist
LINE 1: select chr(4036005254);
^
HINT: No function matches the given name and argument types. You might need to
add explicit type casts.


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: andrew(at)supernews(dot)com
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 20:50:36
Message-ID: 46116C9C.80300@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> Andrew - Supernews wrote:
>> On 2007-04-01, Mark Dilger <pgsql(at)markdilger(dot)com> wrote:
>>> Do any of the string functions (see
>>> http://www.postgresql.org/docs/8.2/interactive/functions-string.html)
>>> run the risk of generating invalid utf8 encoded strings? Do I need
>>> to add checks?
>>> Are there known bugs with these functions in this regard?
>>
>> The chr() function returns an octet, rather than a character; this is
>> clearly
>> wrong and needs fixing.
>>
>
> Ok, I've altered the chr() function. I am including a transcript from
> psql below. There are several design concerns:
>
> 1) In the current implementation, chr(0) returns a 5-byte text object
> (4-bytes of overhead plus one byte of data) containing a null. In the
> new implementation, this returns an error. I don't know, but it is
> possible that people currently use things like "SELECT chr(0) || chr(0)
> || ..." to build up strings of nulls.
>
> 2) Under utf8, chr(X) fails for X = 128..255. This may also break
> current users expectations.
>
> 3) The implicit modulus operation that was being performed by chr() is
> now gone, which might break some users.
>
> 4) You can't represent the high end of the astral plain with type
> INTEGER, unless you pass in a negative value, which is somewhat
> unintuitive. Since chr() expects an integer (and not a bigint) the user
> needs handle the sign bit correctly.
>
> mark
>
> ---------------------
>
>
>
>
> Welcome to psql 8.3devel, the PostgreSQL interactive terminal.
>
> Type: \copyright for distribution terms
> \h for help with SQL commands
> \? for help with psql commands
> \g or terminate with semicolon to execute query
> \q to quit
>
> pgsql=# select chr(0);
> ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(65);
> chr
> -----
> A
> (1 row)
>
> pgsql=# select chr(128);
> ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8"
> pgsql=# select chr(53398);
> chr
> -----
> Ж
> (1 row)
>
> pgsql=# select chr(14989485);
> chr
> -----
> 中
> (1 row)
>
> pgsql=# select chr(4036005254);
> ERROR: function chr(bigint) does not exist
> LINE 1: select chr(4036005254);
> ^
> HINT: No function matches the given name and argument types. You might
> need to add explicit type casts.

Here's the code for the new chr() function:

Datum
chr(PG_FUNCTION_ARGS)
{
int32 cvalue = PG_GETARG_INT32(0);
text *result;

if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
{
int encoding,
len,
byteoff;
uint32 buf[2];
const char *bufptr;

encoding = GetDatabaseEncoding();
buf[0] = htonl(cvalue);
buf[1] = 0;
bufptr = (const char *)&buf;
for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff,
++bufptr);
len = pg_encoding_mblen(encoding,bufptr);
if (byteoff + len != sizeof(uint32) || !pg_verify_mbstr(encoding,
bufptr, len, true /* noError */))
report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr,
sizeof(int32));
result = (text *) palloc(VARHDRSZ + len);
SET_VARSIZE(result, VARHDRSZ + len);
memcpy(VARDATA(result),bufptr,len);
}
else
{
result = (text *) palloc(VARHDRSZ + 1);
SET_VARSIZE(result, VARHDRSZ + 1);
*VARDATA(result) = (char) cvalue;
}

PG_RETURN_TEXT_P(result);
}


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 22:02:21
Message-ID: 46117D6D.7050705@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>>> pgsql=# select chr(14989485);
>>> chr
>>> -----
>>> 中
>>> (1 row)
>
> Is there a principled rationale for this particular behavior as
> opposed to any other?
>
> In particular, in UTF8 land I'd have expected the argument of chr()
> to be interpreted as a Unicode code point, not as actual UTF8 bytes
> with a randomly-chosen endianness.
>
> Not sure what to do in other multibyte encodings.

"Not sure what to do in other multibyte encodings" was pretty much my rationale
for this particular behavior. I standardized on network byte order because
there are only two endianesses to choose from, and the other seems to be a more
surprising choice.

I looked around on the web for a standard for how to convert an integer into a
valid multibyte character and didn't find anything. Andrew, Supernews has said
upthread that chr() is clearly wrong and needs to be fixed. If so, we need some
clear definition what "fixed" means.

Any suggestions?

mark


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 22:05:27
Message-ID: 46117E27.9030703@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> Tom Lane wrote:
>> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>>>> pgsql=# select chr(14989485);
>>>> chr
>>>> -----
>>>> 中
>>>> (1 row)
>>
>> Is there a principled rationale for this particular behavior as
>> opposed to any other?
>>
>> In particular, in UTF8 land I'd have expected the argument of chr()
>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>> with a randomly-chosen endianness.
>>
>> Not sure what to do in other multibyte encodings.
>
> "Not sure what to do in other multibyte encodings" was pretty much my
> rationale for this particular behavior. I standardized on network byte
> order because there are only two endianesses to choose from, and the
> other seems to be a more surprising choice.
>
> I looked around on the web for a standard for how to convert an integer
> into a valid multibyte character and didn't find anything. Andrew,
> Supernews has said upthread that chr() is clearly wrong and needs to be
> fixed. If so, we need some clear definition what "fixed" means.
>
> Any suggestions?
>
> mark

Another issue to consider when thinking about the corect definition of chr() is
that ascii(chr(X)) = X. This gets weird if X is greater than 255. If nothing
else, the name "ascii" is no longer appropriate.

mark


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 22:18:55
Message-ID: 4611814F.1070308@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> Tom Lane wrote:
>> Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>>>> pgsql=# select chr(14989485);
>>>> chr
>>>> -----
>>>> 中
>>>> (1 row)
>>
>> Is there a principled rationale for this particular behavior as
>> opposed to any other?
>>
>> In particular, in UTF8 land I'd have expected the argument of chr()
>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>> with a randomly-chosen endianness.
>>
>> Not sure what to do in other multibyte encodings.
>
> "Not sure what to do in other multibyte encodings" was pretty much my
> rationale for this particular behavior. I standardized on network byte
> order because there are only two endianesses to choose from, and the
> other seems to be a more surprising choice.
>
> I looked around on the web for a standard for how to convert an integer
> into a valid multibyte character and didn't find anything. Andrew,
> Supernews has said upthread that chr() is clearly wrong and needs to be
> fixed. If so, we need some clear definition what "fixed" means.
>
> Any suggestions?
>
> mark

Since chr() is defined in oracle_compat.c, I decided to look at what Oracle
might do. See
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm

It looks to me like they are doing the same thing that I did, though I don't
have Oracle installed anywhere to verify that. Is there a difference?

mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, andrew(at)supernews(dot)com
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 22:37:11
Message-ID: 29864.1175553431@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
>> pgsql=# select chr(14989485);
>> chr
>> -----
>>
>> (1 row)

Is there a principled rationale for this particular behavior as
opposed to any other?

In particular, in UTF8 land I'd have expected the argument of chr()
to be interpreted as a Unicode code point, not as actual UTF8 bytes
with a randomly-chosen endianness.

Not sure what to do in other multibyte encodings.

regards, tom lane


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-02 23:11:57
Message-ID: 46118DBD.2010005@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:

> Since chr() is defined in oracle_compat.c, I decided to look at what
> Oracle might do. See
> http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm
>
>
> It looks to me like they are doing the same thing that I did, though I
> don't have Oracle installed anywhere to verify that. Is there a
> difference?

Reading that page again, I think I'd have to use mbrtowc() or similar in the
spot where I'm currently just using the literal utf8 string.


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 00:58:32
Message-ID: slrnf139lo.2i67.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2007-04-02, Mark Dilger <pgsql(at)markdilger(dot)com> wrote:
> Here's the code for the new chr() function:
>
> if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())

Clearly wrong - this allows returning invalid UTF8 data in locale C, which
is not an uncommon setting to use.

Treating the parameter as bytes is wrong too - it should correspond to
whatever the natural character numbering for the encoding is; for utf8
that is the Unicode code point.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mark Dilger *EXTERN*" <pgsql(at)markdilger(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 09:43:21
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB201FF8372@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
>>> In particular, in UTF8 land I'd have expected the argument of chr()
>>> to be interpreted as a Unicode code point, not as actual UTF8 bytes
>>> with a randomly-chosen endianness.
>>>
>>> Not sure what to do in other multibyte encodings.
>>
>> "Not sure what to do in other multibyte encodings" was pretty much my

>> rationale for this particular behavior. I standardized on network
byte
>> order because there are only two endianesses to choose from, and the
>> other seems to be a more surprising choice.
>
> Since chr() is defined in oracle_compat.c, I decided to look
> at what Oracle might do. See
>
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/func
tions18a.htm
>
> It looks to me like they are doing the same thing that I did,
> though I don't have Oracle installed anywhere to verify that.
> Is there a difference?

This is Oracle 10.2.0.3.0 ("latest and greatest") with UTF-8 encoding
(actually, Oracle chooses to call this encoding AL32UTF8):

SQL> SELECT ASCII('EUR') AS DEC,
2 TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX
3 FROM DUAL;

DEC HEX
---------- ----------------------------
14844588 E282AC

SQL> SELECT CHR(14844588) AS EURO FROM DUAL;

EURO
----
EUR

I don't see how endianness enters into this at all - isn't that just
the question of how a byte is stored physically?

According to RFC 2279, the Euro,
Unicode code point 0x20AC = 0010 0000 1010 1100,
will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.

IMHO this is the only good and intuitive way for CHR() and ASCII().

Yours,
Laurenz Albe


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 13:43:08
Message-ID: slrnf14mfc.2i67.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2007-04-03, "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at> wrote:
> According to RFC 2279, the Euro,
> Unicode code point 0x20AC = 0010 0000 1010 1100,
> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
>
> IMHO this is the only good and intuitive way for CHR() and ASCII().

It is beyond ludicrous for functions like chr() or ascii() to convert a
Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There is _NO SUCH
THING_ as 0xE282AC as a representation of a Unicode character - there is
either the code point, 0x20AC (which is a _number_), or the sequences of
_bytes_ that represent that code point in various encodings, of which the
three-byte sequence 0xE2 0x82 0xAC is the one used in UTF-8.

Functions like chr() and ascii() should be dealing with the _number_ of the
code point, not with its representation in transfer encodings.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: Mark Dilger *EXTERN* <pgsql(at)markdilger(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 14:36:18
Message-ID: 20070403143618.GA5405@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
> IMHO this is the only good and intuitive way for CHR() and ASCII().

Hardly. The comment earlier about mbtowc was much closer to the mark.
And wide characters are defined as Unicode points.

Basically, CHR() takes a unicode point and returns that character
in a string appropriately encoded. ASCII() does the reverse.

Just about every multibyte encoding other than Unicode has the problem
of not distinguishing between the code point and the encoding of it.
Unicode is a collection of encodings based on the same set.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 15:47:14
Message-ID: 46127702.8060100@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote:
>> IMHO this is the only good and intuitive way for CHR() and ASCII().
>
> Hardly. The comment earlier about mbtowc was much closer to the mark.
> And wide characters are defined as Unicode points.
>
> Basically, CHR() takes a unicode point and returns that character
> in a string appropriately encoded. ASCII() does the reverse.
>
> Just about every multibyte encoding other than Unicode has the problem
> of not distinguishing between the code point and the encoding of it.
> Unicode is a collection of encodings based on the same set.
>
> Have a nice day,

Thanks for the feedback. Would you say that the way I implemented things in the
example code would be correct for multibyte non Unicode encodings? I don't see
how to avoid the endianness issue for those encodings.

mark


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: <andrew(at)supernews(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 15:47:27
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB20203DD1F@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew wrote:
>> According to RFC 2279, the Euro,
>> Unicode code point 0x20AC = 0010 0000 1010 1100,
>> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC.
>>
>> IMHO this is the only good and intuitive way for CHR() and ASCII().
>
> It is beyond ludicrous for functions like chr() or ascii() to
> convert a Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There
> is _NO SUCH THING_ as 0xE282AC as a representation of a Unicode
character
> - there is either the code point, 0x20AC (which is a _number_), or the
> sequences of _bytes_ that represent that code point in various
encodings,
> of which the three-byte sequence 0xE2 0x82 0xAC is the one used in
UTF-8.

Yes, 0xE2 0x82 0xAC is the representation in UTF-8, and UTF-8 is the
database encoding in use.

> Functions like chr() and ascii() should be dealing with the _number_
of the
> code point, not with its representation in transfer encodings.

I think that we have a fundamental difference.

As far as I know, the word "code point" is only used in UNICODE and
is the first column in the list
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

So, if I understand you correctly, you want CHR() and ASCII()
to convert between characters (in the current database encoding)
and UNICODE code points (independent of database encoding).

What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
partly for Oracle compatibility?) is that CHR() and ASCII()
convert between a character (in database encoding) and
that database encoding in numeric form.

I think that what you suggest would be a useful function too,
but I certainly wouldn't call such a function ASCII() :^)

The current implementation seems closer to my idea of ASCII(),
only incomplete:

test=> select to_hex(ascii('EUR'));
to_hex
--------
e2
(1 row)

What do others think? Should the argument to CHR() be a Unicode
code point or the numeric representation of the database encoding?

Yours,
Laurenz Albe


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 16:44:36
Message-ID: 46128474.5000507@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Albe Laurenz wrote:
> What I suggest (and what Oracle implements, and isn't CHR() and ASCII()
> partly for Oracle compatibility?) is that CHR() and ASCII()
> convert between a character (in database encoding) and
> that database encoding in numeric form.

Looking at Oracle documentation, it appears that you get different
behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it
with the argument USING NCHAR_CS. Oracle 9i and higher have an
additional function called NCHR(X) which is supposed to be the same as
CHR(X USING NCHAR_CS).

On http://www.oraclehome.co.uk/chr-function.htm it says that "To use
UTF8, you specify using nchar_cs in the argument list". Does this mean
that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe
Laurenz wants? Vice versa?

I'm not saying that Oracle compatibility is paramount. But if we can
get compatibility and a reasonable implementation at the same time, that
seems like a bonus.

Once again, I don't have Oracle installed and cannot test this :(

mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-03 17:06:38
Message-ID: 9805.1175619998@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> Martijn van Oosterhout wrote:
>> Just about every multibyte encoding other than Unicode has the problem
>> of not distinguishing between the code point and the encoding of it.

> Thanks for the feedback. Would you say that the way I implemented things in the
> example code would be correct for multibyte non Unicode encodings?

I think it's probably defensible for non-Unicode encodings. To do
otherwise would require (a) figuring out what the equivalent concept to
"code point" is for each encoding, and (b) having a separate code path
for each encoding to perform the mapping. It's not clear that there
even is an answer to (a), and (b) seems like more work than chr() is
worth. But we know what the right way is for Unicode, so we should
special case that one.

Note the points made that in all cases ascii() and chr() should be
inverses, and that you shouldn't just fall back to the old behavior
in SQL_ASCII encoding. (My vote for SQL_ASCII would be to reject
values > 255.)

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Dilger <pgsql(at)markdilger(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 06:01:56
Message-ID: 20070404060156.GA22542@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
> I think it's probably defensible for non-Unicode encodings. To do
> otherwise would require (a) figuring out what the equivalent concept to
> "code point" is for each encoding, and (b) having a separate code path
> for each encoding to perform the mapping. It's not clear that there
> even is an answer to (a), and (b) seems like more work than chr() is
> worth. But we know what the right way is for Unicode, so we should
> special case that one.

I dunno. I find it odd that if I want a pl/pgsql function to return a
Euro symbol, it has to know what encoding the DB is in. Though I
suppose that would call for a unicode_chr() function.

Is there any multibyte mapping other than unicode that distinguishes
between the character set and the encoding thereof?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Mark Dilger *EXTERN*" <pgsql(at)markdilger(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 07:40:02
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB20203E26F@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
>> What I suggest (and what Oracle implements, and isn't CHR() and
ASCII()
>> partly for Oracle compatibility?) is that CHR() and ASCII()
>> convert between a character (in database encoding) and
>> that database encoding in numeric form.
>
> Looking at Oracle documentation, it appears that you get different
> behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it

> with the argument USING NCHAR_CS. Oracle 9i and higher have an
> additional function called NCHR(X) which is supposed to be the same as

> CHR(X USING NCHAR_CS).
>
> On http://www.oraclehome.co.uk/chr-function.htm it says that "To use
> UTF8, you specify using nchar_cs in the argument list". Does this
mean
> that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe
> Laurenz wants? Vice versa?

That web page is misleading at least, if not downright wrong.

It's just that an Oracle database has 2 character sets, a "database
character set" and a "national character set", the latter always being a
UNICODE encoding (the name "national character set" is somewhat
misleading).

This baroque concept is from those days when nobody had a UNICODE
database, but people still wanted to store characters not supported
by the "database character set" - in that case you could define a column
to be in the "national character set".

CHR(n) and CHR(n USING NCHAR_CS) = NCHR(n) are the same function, only
that the first one uses the "database character set" and the latter ones
the "national character set".

Nowadays this Oracle concept of "national character set" is nearly
obsolete, one normally uses a UNICODE "database character set".

Oracle has two things to say about CHR():

"For single-byte character sets, if n > 256, then Oracle Database
returns the binary equivalent of n mod 256. For multibyte character
sets, n must resolve to one entire code point. Invalid code points
are not validated, and the result of specifying invalid code points
is indeterminate."

It seems that Oracle means "encoding" when it says "code point" :^)
We should of course reject invalid arguments!
I don't know if I like this modulus thing for single byte encodings
or not...

"Use of the CHR function (either with or without the optional USING
NCHAR_CS clause) results in code that is not portable between ASCII-
and EBCDIC-based machine architectures."

There's one thing that strikes me as weird in your implementation:

> pgsql=# select chr(0);
> ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in
"UTF8"

0x00 is a valid UNICODE code point and also a valid UTF-8 character!

To me (maybe only to me) CHR() and ASCII() have always had the look
and feel of "type casts" between "char" and integer, with all the lack
of portability this might imply.

Yours,
Laurenz Albe


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>, <andrew(at)supernews(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 08:12:35
Message-ID: E1539E0ED7043848906A8FF995BDA57901E7B67A@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> What do others think? Should the argument to CHR() be a
> Unicode code point or the numeric representation of the
> database encoding?

When the database uses a single byte encoding, the chr function takes
the binary byte representation as an integer number between 0 and 255
(e.g. ascii code).
When the database encoding is one of the unicode encodings it takes a
unicode code point.
This is also what Oracle does.

Not sure what to do with other multibyte encodings.
Oracle only states that the numeric argument must resolve to one entire
code point,
whatever that is.

Andreas


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Zeugswetter Andreas ADI SD *EXTERN*" <ZeugswetterA(at)spardat(dot)at>, <andrew(at)supernews(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 08:34:52
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB20203E424@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> When the database uses a single byte encoding, the chr function takes
> the binary byte representation as an integer number between 0 and 255
> (e.g. ascii code).
> When the database encoding is one of the unicode encodings it takes a
> unicode code point.
> This is also what Oracle does.

Sorry, but this is *NOT* what Oracle does.
At least if we can agree that the code point for the Euro sign
is 0x20AC.

SQL> SELECT ASCII('EUR') AS DEC,
2 TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX
3 FROM DUAL;

DEC HEX
---------- ----------------------------
14844588 E282AC

The encoding in this example is AL32UTF8, which corresponds to
our UTF8.

Yours,
Laurenz Albe


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>, <andrew(at)supernews(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 10:02:07
Message-ID: E1539E0ED7043848906A8FF995BDA57901E7B6D4@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> > When the database uses a single byte encoding, the chr function
takes
> > the binary byte representation as an integer number between 0 and
255
> > (e.g. ascii code).
> > When the database encoding is one of the unicode encodings it takes
a
> > unicode code point.
> > This is also what Oracle does.
>
> Sorry, but this is *NOT* what Oracle does.
> At least if we can agree that the code point for the Euro
> sign is 0x20AC.

yes

>
> SQL> SELECT ASCII('EUR') AS DEC,
> 2 TO_CHAR(ASCII('EUR'), 'XXXXXX') AS HEX
> 3 FROM DUAL;
>
> DEC HEX
> ---------- ----------------------------
> 14844588 E282AC
>
> The encoding in this example is AL32UTF8, which corresponds
> to our UTF8.

You are right, I am sorry. My test was broken.

To get the euro symbol in Oracle with a AL32UTF8 encoding you use
chr(14844588)

Andreas


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Dilger <pgsql(at)markdilger(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 13:26:38
Message-ID: 20070404132638.GB8549@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote:
> > I think it's probably defensible for non-Unicode encodings. To do
> > otherwise would require (a) figuring out what the equivalent concept to
> > "code point" is for each encoding, and (b) having a separate code path
> > for each encoding to perform the mapping. It's not clear that there
> > even is an answer to (a), and (b) seems like more work than chr() is
> > worth. But we know what the right way is for Unicode, so we should
> > special case that one.
>
> I dunno. I find it odd that if I want a pl/pgsql function to return a
> Euro symbol, it has to know what encoding the DB is in. Though I
> suppose that would call for a unicode_chr() function.

Right -- IMHO what we should be doing is reject any input to chr() which
is beyond plain ASCII (or maybe > 255), and create a separate function
(unicode_char() sounds good) to get an Unicode character from a code
point, converted to the local client_encoding per conversion_procs.

So if I'm in Latin-1 and ask for the Euro sign, this should fail because
Latin-1 does not have the euro sign. If I'm in Latin-9 I should get the
Euro.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 13:45:28
Message-ID: slrnf17avo.2i67.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2007-04-04, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> Right -- IMHO what we should be doing is reject any input to chr() which
> is beyond plain ASCII (or maybe > 255), and create a separate function
> (unicode_char() sounds good) to get an Unicode character from a code
> point, converted to the local client_encoding per conversion_procs.

Thinking about this made me realize that there's another, ahem, elephant
in the room here: convert().

By definition convert() returns text strings which are not valid in the
server encoding. How can this be addressed?

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Mark Dilger <pgsql(at)markdilger(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 14:22:28
Message-ID: 21381.1175696548@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Right -- IMHO what we should be doing is reject any input to chr() which
> is beyond plain ASCII (or maybe > 255), and create a separate function
> (unicode_char() sounds good) to get an Unicode character from a code
> point, converted to the local client_encoding per conversion_procs.

Hm, I hadn't thought of that approach, but another idea is that the
argument of chr() is *always* a unicode code point, and it converts
to the current encoding. Do we really need a separate function?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: andrew(at)supernews(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:11:20
Message-ID: 21914.1175699480@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew - Supernews <andrew+nonews(at)supernews(dot)com> writes:
> Thinking about this made me realize that there's another, ahem, elephant
> in the room here: convert().
> By definition convert() returns text strings which are not valid in the
> server encoding. How can this be addressed?

Remove convert(). Or at least redefine it as dealing in bytea not text.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Mark Dilger <pgsql(at)markdilger(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:21:41
Message-ID: 20070404152141.GD22542@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 04, 2007 at 10:22:28AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
>
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding. Do we really need a separate function?

That's what I'd advocate, but then we're not Oracle compatable...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: alvherre(at)commandprompt(dot)com, kleptog(at)svana(dot)org, pgsql(at)markdilger(dot)com, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:33:40
Message-ID: 20070405.003340.123426498.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
>
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding. Do we really need a separate function?

To be honest, I don't really see why we need to rush to add such
Unicode(I assume we are reffering to "Unicode" as ISO 10646)
specialized functions at this point. Limiting chr() to ASCII range is
enough, I think.

BTW, every encoding has its own charset. However the relationship
between encoding and charset are not so simple as Unicode. For
example, encoding EUC_JP correponds to multiple charsets, namely
ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
returns a "code point" is not quite usefull since it lacks the charset
info. I think we need to continute design discussion, probably
targetting for 8.4, not 8.3.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:41:19
Message-ID: 4613C71F.6010906@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Albe Laurenz wrote:
> There's one thing that strikes me as weird in your implementation:
>
>> pgsql=# select chr(0);
>> ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in
> "UTF8"
>
> 0x00 is a valid UNICODE code point and also a valid UTF-8 character!

It's not my code that rejects this. I'm passing the resultant string to
the pg_verify_mbstr(...) function and it is rejecting a null. I could
change that, of course, but if other parts of the system found it
reasonable to reject null, why should chr() be different?

This was also one of the questions I asked upthread. Does changing the
behavior of chr() break people's reasonable expectations?

mark


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, kleptog(at)svana(dot)org, pgsql(at)markdilger(dot)com, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:50:32
Message-ID: 20070404155032.GH8549@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii wrote:

> BTW, every encoding has its own charset. However the relationship
> between encoding and charset are not so simple as Unicode. For
> example, encoding EUC_JP correponds to multiple charsets, namely
> ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
> returns a "code point" is not quite usefull since it lacks the charset
> info. I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

Is Unicode complete as far as Japanese chars go? I mean, is there a
character in EUC_JP that is not representable in Unicode?

Because if Unicode is complete, ISTM it makes perfect sense to have a
unicode_char() (or whatever we end up calling it) that takes an Unicode
code point and returns a character in whatever JIS set you want
(specified by setting client_encoding to that). Because then you solved
the problem nicely.

One thing that I find confusing in your text above is whether EUC_JP is
an encoding or a charset? I would think that the various JIS X are
encodings, and EUC_JP is the charset; or is it the other way around?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, kleptog(at)svana(dot)org, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:56:50
Message-ID: 4613CAC2.30807@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii wrote:

> <SNIP>. I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

The discussion came about because Andrew - Supernews noticed that chr()
returns invalid utf8, and we're trying to fix all the bugs with invalid
utf8 in the system. Something needs to be done, even if we just check
the result of the current chr() implementation and throw an error on
invalid results. But do we want to make this minor change for 8.3 and
then change it again for 8.4?

Here's an example of the current problem. It's an 8.2.3 database with
utf8.en_US encoding

mark=# create table testutf8 (t text);
CREATE TABLE
mark=# insert into testutf8 (t) (select chr(gs) from
generate_series(0,255) as gs);
INSERT 0 256
mark=# \copy testutf8 to testutf8.data
mark=# truncate testutf8;
TRUNCATE TABLE
mark=# \copy testutf8 from testutf8.data
ERROR: invalid byte sequence for encoding "UTF8": 0x80
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY testutf8, line 129


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Mark Dilger <pgsql(at)markdilger(dot)com>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 15:57:57
Message-ID: 200704041757.58574.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Mittwoch, 4. April 2007 16:22 schrieb Tom Lane:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Right -- IMHO what we should be doing is reject any input to chr() which
> > is beyond plain ASCII (or maybe > 255), and create a separate function
> > (unicode_char() sounds good) to get an Unicode character from a code
> > point, converted to the local client_encoding per conversion_procs.
>
> Hm, I hadn't thought of that approach, but another idea is that the
> argument of chr() is *always* a unicode code point, and it converts
> to the current encoding. Do we really need a separate function?

The SQL standard has a "Unicode character string literal", which looks like
this:

U&'The price is 100 \20AC.'

This is similar in spirit to our current escape mechanism available via E'...'
which, however, produces bytes. It has the advantage over a chr()-based
mechanism that the composition of strings doesn't require an ugly chain of
literals, functions, and concatenations.

Implementing this would, however, be a bit tricky because you don't have
access to the encoding conversion functions in the lexer. You would probably
have to map that to a function call an evaluate it later.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 17:01:18
Message-ID: 23020.1175706078@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> Albe Laurenz wrote:
>> 0x00 is a valid UNICODE code point and also a valid UTF-8 character!

> It's not my code that rejects this. I'm passing the resultant string to
> the pg_verify_mbstr(...) function and it is rejecting a null. I could
> change that, of course, but if other parts of the system found it
> reasonable to reject null, why should chr() be different?

chr() really should reject that if it's going to do any error checking
at all, because text operations tend to misbehave on embedded nulls.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: pgsql(at)markdilger(dot)com
Cc: ishii(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, kleptog(at)svana(dot)org, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-04 21:52:06
Message-ID: 20070405.065206.51277520.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tatsuo Ishii wrote:
>
> > <SNIP>. I think we need to continute design discussion, probably
> > targetting for 8.4, not 8.3.
>
> The discussion came about because Andrew - Supernews noticed that chr()
> returns invalid utf8, and we're trying to fix all the bugs with invalid
> utf8 in the system. Something needs to be done, even if we just check
> the result of the current chr() implementation and throw an error on
> invalid results. But do we want to make this minor change for 8.3 and
> then change it again for 8.4?

My opinion was in the snipped part by you in the previous mail --
Limiting chr() to ASCII range
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Here's an example of the current problem. It's an 8.2.3 database with
> utf8.en_US encoding
>
>
> mark=# create table testutf8 (t text);
> CREATE TABLE
> mark=# insert into testutf8 (t) (select chr(gs) from
> generate_series(0,255) as gs);
> INSERT 0 256
> mark=# \copy testutf8 to testutf8.data
> mark=# truncate testutf8;
> TRUNCATE TABLE
> mark=# \copy testutf8 from testutf8.data
> ERROR: invalid byte sequence for encoding "UTF8": 0x80
> HINT: This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT: COPY testutf8, line 129
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: alvherre(at)commandprompt(dot)com
Cc: ishii(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, kleptog(at)svana(dot)org, pgsql(at)markdilger(dot)com, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 00:34:25
Message-ID: 20070405.093425.67003822.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tatsuo Ishii wrote:
>
> > BTW, every encoding has its own charset. However the relationship
> > between encoding and charset are not so simple as Unicode. For
> > example, encoding EUC_JP correponds to multiple charsets, namely
> > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
> > returns a "code point" is not quite usefull since it lacks the charset
> > info. I think we need to continute design discussion, probably
> > targetting for 8.4, not 8.3.
>
> Is Unicode complete as far as Japanese chars go? I mean, is there a
> character in EUC_JP that is not representable in Unicode?

I don't think Unicode is "complete" in this case. Problems are: EUC_JP
allows user defined characters which are not mapped to Unicode. Also
some characters in EUC_JP corresponds to multiple Unicode points.

> Because if Unicode is complete, ISTM it makes perfect sense to have a
> unicode_char() (or whatever we end up calling it) that takes an Unicode
> code point and returns a character in whatever JIS set you want
> (specified by setting client_encoding to that). Because then you solved
> the problem nicely.

I'm not sure what kind of use case for unicode_char() you are thinking
about. Anyway if you want a "code point" from a character, we could
easily add such functions to all backend encodings currently we
support. Probably it would look like:

to_code_point(str TEXT) returns TEXT

An example outputs are:

ASCII - 41
ISO 10646 - U+0041
ISO 10646 - U+29E3D
ISO 8859-1 - a5
JIS X 0208 - 4141

It's a little bit too late for 8.2 though.

> One thing that I find confusing in your text above is whether EUC_JP is
> an encoding or a charset? I would think that the various JIS X are
> encodings, and EUC_JP is the charset; or is it the other way around?

No, EUC_JP is an encoding. JIS X are the charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 00:56:14
Message-ID: 20070405.095614.95827390.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Andrew - Supernews <andrew+nonews(at)supernews(dot)com> writes:
> > Thinking about this made me realize that there's another, ahem, elephant
> > in the room here: convert().
> > By definition convert() returns text strings which are not valid in the
> > server encoding. How can this be addressed?
>
> Remove convert(). Or at least redefine it as dealing in bytea not text.

That would break some important use cases.

1) A user have UTF-8 database which contains various language
data. Each language has its own table. He wants to sort a SELECT
result by using ORDER BY. Since locale cannot handle multiple
languages, he uses C locale and do the SELECT something like this:

SELECT * FROM french_table ORDER BY convert(t, 'LATIN1');
SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale
is broken. He decided to use C locale and want to sort the result
from SELECT like this.

SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

Note that sorting by UTF-8 physical order would produce random
results. So following would not help him in this case:

SELECT * FROM japanese_table ORDER BY t;

Also I don't understand what this is different to the problem when we
have a message catalogue which does not match the encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Andrew - Supernews <andrew+nonews(at)supernews(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 01:35:19
Message-ID: slrnf18kin.2i67.andrew+nonews@atlantis.supernews.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2007-04-05, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> Andrew - Supernews <andrew+nonews(at)supernews(dot)com> writes:
>> > Thinking about this made me realize that there's another, ahem, elephant
>> > in the room here: convert().
>> > By definition convert() returns text strings which are not valid in the
>> > server encoding. How can this be addressed?
>>
>> Remove convert(). Or at least redefine it as dealing in bytea not text.
>
> That would break some important use cases.
>
> 1) A user have UTF-8 database which contains various language
> data. Each language has its own table. He wants to sort a SELECT
> result by using ORDER BY. Since locale cannot handle multiple
> languages, he uses C locale and do the SELECT something like this:
>
> SELECT * FROM french_table ORDER BY convert(t, 'LATIN1');
> SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

That works without change if convert(text,text) returns bytea.
>
> 2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale
> is broken. He decided to use C locale and want to sort the result
> from SELECT like this.
>
> SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP');

That also works without change if convert(text,text) returns bytea.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: <pgsql(at)markdilger(dot)com>
Cc: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <alvherre(at)commandprompt(dot)com>, <kleptog(at)svana(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 09:52:14
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB202088849@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii wrote:
> I think we need to continute design discussion, probably
> targetting for 8.4, not 8.3.

But isn't a simple fix for chr() and ascii(), which does not
require a redesign, a Good Thing for 8.3 if possible? Something
that maintains as much upward and/or Oracle compatibility as
possible while doing away with ascii('EUR') returning 226 in UTF-8?

And I think - correct me if I am wrong - that conversion between
character and integer representation of the character in the current
database encoding is exactly that.

I see Tom Lane's point in rejecting chr(0), though.

Yours,
Laurenz Albe


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: alvherre(at)commandprompt(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql(at)markdilger(dot)com, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 11:58:48
Message-ID: 20070405115848.GB17587@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 05, 2007 at 09:34:25AM +0900, Tatsuo Ishii wrote:
> I'm not sure what kind of use case for unicode_char() you are thinking
> about. Anyway if you want a "code point" from a character, we could
> easily add such functions to all backend encodings currently we
> support. Probably it would look like:

I think the problem is that most encodings do not have the concept of a
code point anyway, so implementing it for them is fairly useless.

> An example outputs are:
>
> ASCII - 41
> ISO 10646 - U+0041
> ISO 10646 - U+29E3D
> ISO 8859-1 - a5
> JIS X 0208 - 4141

In every case other than Unicode you're doing the same thing as
encode/decode. Since we already have those functions, there's no need
to get chr/ascii to duplicate it. In the case of UTF-8 however, it does
something that is not done by encode/decode, hence the proposal to
simply extend chr/ascii to do that.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>
Cc: pgsql(at)markdilger(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 12:29:42
Message-ID: 20070405122942.GC17587@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 05, 2007 at 11:52:14AM +0200, Albe Laurenz wrote:
> But isn't a simple fix for chr() and ascii(), which does not
> require a redesign, a Good Thing for 8.3 if possible? Something
> that maintains as much upward and/or Oracle compatibility as
> possible while doing away with ascii('EUR') returning 226 in UTF-8?

I think the earlier expressed idea of getting chr/ascii to bail on
non-ascii character isn't a bad one.

I think your idea is bad in the sense that I actually need to know the
encoding of the character I want to be able to use it. If I knew the
encoding already, I could just use encode().

What I was thinking of was something that, irrespective of the
encoding, gave me a string properly encoded with the character I want.
Since AFAIK Unicode is the only character set that actually numbers the
characters in a way not related to the encoding, so it would seem
useful to be able to give a unicode character number and get a string
with that character...

So your implemntation is simply:
1. Take number and make UTF-8 string
2. Convert it to database encoding.

> And I think - correct me if I am wrong - that conversion between
> character and integer representation of the character in the current
> database encoding is exactly that.

AFAIK there is no "integer representation" of a character in anything
other than Unicode. Unicode is the only case that cannot be handled by
a simple encode/decode.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, alvherre(at)commandprompt(dot)com, pgsql(at)markdilger(dot)com, all(at)adv(dot)magwien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-05 14:47:18
Message-ID: 27120.1175784438@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> I think the problem is that most encodings do not have the concept of a
> code point anyway, so implementing it for them is fairly useless.

Yeah. I'm beginning to think that the right thing to do is

(a) make chr/ascii do the same thing as Oracle (ie, as in the discussed
patch)

(b) make a different pair of functions that translate Unicode code
points to/from the current database encoding.

regards, tom lane


From: "Albe Laurenz" <all(at)adv(dot)magwien(dot)gv(dot)at>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: <pgsql(at)markdilger(dot)com>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-04-06 06:56:53
Message-ID: AFCCBB403D7E7A4581E48F20AF3E5DB2020DE687@EXADV1.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> So your implemntation is simply:
> 1. Take number and make UTF-8 string
> 2. Convert it to database encoding.

Aah, now I can spot where the misunderstanding is.
That's not what I mean.

I mean that chr() should simply 'typecast' to "char".

So when the database encoding is UTF8, I want
chr(14844588) to return a Euro sign, and when the encoding
is LATIN9, then chr(14844588) should either yield the 'not'
sign (UNICODE 0xAC) or an error message, depending on whether
we want chr() to operate mod 256 like Oracle has it for
single byte character sets or not, while chr(164) should
return the Euro sign for LATIN9 database encoding.

Yours,
Laurenz Albe


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>
Subject: Re: Bug in UTF8-Validation Code?
Date: 2007-06-13 23:35:55
Message-ID: 46707F5B.7080802@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


What is the state of play with this item? I think this is a must-fix bug
for 8.3. There was a flurry of messages back in April but since then I
don't recall seeing anything.

cheers

andrew

Mark Dilger wrote:
> Mark Dilger wrote:
>> Bruce Momjian wrote:
>>> Added to TODO:
>>>
>>> * Fix cases where invalid byte encodings are accepted by the
>>> database,
>>> but throw an error on SELECT
>>>
>>> http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php
>>>
>>> Is anyone working on fixing this bug?
>>
>> Hi, has anyone volunteered to fix this bug? I did not see any reply
>> on the mailing list to your question above.
>>
>> mark
>
> OK, I can take a stab at fixing this. I'd like to state some
> assumptions so people can comment and reply:
>
> I assume that I need to fix *all* cases where invalid byte encodings
> get into the database through functions shipped in the core distribution.
>
> I assume I do not need to worry about people getting bad data into the
> system through their own database extensions.
>
> I assume that the COPY problem discussed up-thread goes away once you
> eliminate all the paths by which bad data can get into the system.
> However, existing database installations with bad data already loaded
> will not be magically fixed with these code patches.
>
> Do any of the string functions (see
> http://www.postgresql.org/docs/8.2/interactive/functions-string.html)
> run the risk of generating invalid utf8 encoded strings? Do I need to
> add checks? Are there known bugs with these functions in this regard?
>
> If not, I assume I can add mbverify calls to the various input
> routines (textin, varcharin, etc) where invalid utf8 could otherwise
> enter the system.
>
> I assume that this work can be limited to HEAD and that I don't need
> to back-patch it. (I suspect this assumption is a contentious one.)
>
> Advice and comments are welcome,
>
>