Quick Links

Re: plperlu problem with utf8

Lists:	pgsql-hackers

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	plperlu problem with utf8
Date:	2010-12-08 15:13:01
Message-ID:	Pine.LNX.4.64.1012081803160.12632@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi there,

below is the problem, which I don't have when running in shell. The database is
in UTF-8 encoding.

CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
return uri_unescape($_[0]);
$$ LANGUAGE plperlu;
CREATE FUNCTION
Time: 1.416 ms
select url_decode('comment%20passer%20le%20r%C3%A9veillon');
url_decode
------------------------------
comment passer le rveillon
^
non-printed

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-08 16:07:48
Message-ID:	4CFFAD54.1010002@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/08/2010 10:13 AM, Oleg Bartunov wrote:
> Hi there,
>
> below is the problem, which I don't have when running in shell. The
> database is in UTF-8 encoding.
>
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
> CREATE FUNCTION
> Time: 1.416 ms
> select url_decode('comment%20passer%20le%20r%C3%A9veillon');
> url_decode ------------------------------
> comment passer le rveillon
> ^
> non-printed

I get: (platform is Fedora 13, git tip, perl 5.10.1):

andrew=# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS
varchar AS $$
andrew$# use strict;
andrew$# use URI::Escape;
andrew$# return uri_unescape($_[0]); $$ LANGUAGE plperlu;
CREATE FUNCTION
andrew=# select url_decode('comment%20passer%20le%20r%C3%A9veillon');
url_decode
------------------------------
comment passer le rÃ©veillon
(1 row)

andrew=#

which makes it look like we might have some double escaping going on
here, but at least I don't get nothing :-)

Further experimentation shows even more weirdness. There's definitely
something odd about the utf8 handling. Will dig further.

cheers

andrew

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-08 16:13:08
Message-ID:	Pine.LNX.4.64.1012081912200.12632@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

adding utf8::decode($_[0]) solves the problem:

knn=# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
utf8::decode($_[0]);
return uri_unescape($_[0]);
$$ LANGUAGE plperlu;

Oleg
On Wed, 8 Dec 2010, Andrew Dunstan wrote:

>
>
> On 12/08/2010 10:13 AM, Oleg Bartunov wrote:
>> Hi there,
>>
>> below is the problem, which I don't have when running in shell. The
>> database is in UTF-8 encoding.
>>
>> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
>> use strict;
>> use URI::Escape;
>> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>> CREATE FUNCTION
>> Time: 1.416 ms
>> select url_decode('comment%20passer%20le%20r%C3%A9veillon');
>> url_decode ------------------------------
>> comment passer le rveillon
>> ^
>> non-printed
>
>
> I get: (platform is Fedora 13, git tip, perl 5.10.1):
>
> andrew=# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS
> varchar AS $$
> andrew$# use strict;
> andrew$# use URI::Escape;
> andrew$# return uri_unescape($_[0]); $$ LANGUAGE plperlu;
> CREATE FUNCTION
> andrew=# select url_decode('comment%20passer%20le%20r%C3%A9veillon');
> url_decode
> ------------------------------
> comment passer le r?©veillon
> (1 row)
>
> andrew=#
>
> which makes it look like we might have some double escaping going on here,
> but at least I don't get nothing :-)
>
> Further experimentation shows even more weirdness. There's definitely
> something odd about the utf8 handling. Will dig further.
>
> cheers
>
> andrew
>
>
>

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-08 17:18:42
Message-ID:	7A864243-5506-4667-A25C-89AB426439D6@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote:

> adding utf8::decode($_[0]) solves the problem:
>
> knn=# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;

Hrm. Ideally all strings passed to PL/Perl functions would be decoded.

Best,

David

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-08 21:15:53
Message-ID:	Pine.LNX.4.64.1012090015350.12632@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 8 Dec 2010, David E. Wheeler wrote:

> On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote:
>
>> adding utf8::decode($_[0]) solves the problem:
>>
>> knn=# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
>> use strict;
>> use URI::Escape;
>> utf8::decode($_[0]);
>> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> Hrm. Ideally all strings passed to PL/Perl functions would be decoded.

yes, this is what I expected.
>
> Best,
>
> David
>
>
>

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 02:39:54
Message-ID:	AANLkTi=wEQAuw8V+YNB9XyeVHJxtYY0OUat_saB=HjP-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 8, 2010 at 14:15, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> On Wed, 8 Dec 2010, David E. Wheeler wrote:
>
>> On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote:
>>
>>> adding utf8::decode($_[0]) solves the problem:
>> Hrm. Ideally all strings passed to PL/Perl functions would be decoded.
>
> yes, this is what I expected.

Erm... no. The in and out from perl AFAICT works fine (minus a caveat
I found, see the end of the mail).

The problem here is you have the url encoded utf8 bytes "%C3%A9".
URL::Encode basically does chr(hex("c3")) and chr(hex("a9"));. Perl,
generally, will treat that as two separate unicode code points. So
you end up with two characters (one with a code point of 0xc3, the
other with 0xa9) instead of the one character you expect. If you want
\xc3\xa9 to be treated as a utf8 byte sequence, you need to tell perl
those bytes are utf8 by decoding it. Heck for all we know instead of
it being a utf8 sequence, it could have been a utf16 sequence.

You might argue this is a bug with URI::Escape as I *think* all uri's
will be utf8 encoded. Anyway, I think postgres is doing the right
thing here.

In playing around I did find what I think is a postgres bug. Perl has
2 ways it can store things internally. per perldoc perlunicode:

Using Unicode in XS
... What the "UTF8" flag means is that the sequence of octets in the
representation of the scalar is the sequence of UTF-8 encoded code
points of the characters of a string. The "UTF8" flag being off means
that each octet in this representation encodes a single character with
code point 0..255 within the string.

Postgres always prints whatever the internal representation happens to
be ignoring the UTF8 flag and the server encoding.

# create or replace function chr(i int, i2 int) returns text as $$
return chr($_[0]).chr($_[1]); $$ language plperlu;
CREATE FUNCTION

# show server_encoding;
server_encoding
-----------------
SQL_ASCII

# SELECT length(chr(128, 33));
length
--------
2

# SELECT length(chr(128, 333));
length
--------
4

Grr that should error out with "Invalid server encoding", or worst
case should return a length of 3 (it utf8 encoded 128 into 2 bytes
instead of leaving it as 1). In this case the 333 causes perl store
it internally as utf8.

Now on a utf8 database:

# show server_encoding;
server_encoding
-----------------
UTF8

# SELECT length(chr(128, 33));
ERROR: invalid byte sequence for encoding "UTF8": 0x80
CONTEXT: PL/Perl function "chr"

# SELECT length(chr(128, 333));
CONTEXT: PL/Perl function "chr"
length
--------
2

Same thing here, we just end up using the internal format. In one
case it works in the other it does not. The main point being, most of
the time it *happens* to work. But its really just by chance.

I think what we should do is use SvPVutf8() when we are UTF8 instead
of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string
in perls current internal format (maybe unicode, maybe a utf8 byte
sequence). While SvPVutf8 will always give us utf8 (may or may not be
valid!) encoded string.

Something like the attached. Thoughts? Im not very happy with the non
utf8 case-- The elog(ERROR, "invalid byte sequence") is a total
cop-out yes. But I did not see a good solution short of hand rolling
our own version of sv_utf8_downgrade(). Is it worth it?

Attachment	Content-Type	Size
plperl_encoding.patch	text/x-patch	1.1 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 03:24:46
Message-ID:	C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:

> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded. Anyway, I think postgres is doing the right
> thing here.

No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.

> In playing around I did find what I think is a postgres bug. Perl has
> 2 ways it can store things internally. per perldoc perlunicode:
>
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string. The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
>
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
>
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
>
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
>
> # SELECT length(chr(128, 33));
> length
> --------
> 2
>
> # SELECT length(chr(128, 333));
> length
> --------
> 4
>
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1). In this case the 333 causes perl store
> it internally as utf8.

Well with SQL_ASCII anything goes, no?

> Now on a utf8 database:
>
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
>
> # SELECT length(chr(128, 33));
> ERROR: invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT: PL/Perl function "chr"
>
> # SELECT length(chr(128, 333));
> CONTEXT: PL/Perl function "chr"
> length
> --------
> 2
>
> Same thing here, we just end up using the internal format. In one
> case it works in the other it does not. The main point being, most of
> the time it *happens* to work. But its really just by chance.
>
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence). While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
>
> Something like the attached. Thoughts? Im not very happy with the non
> utf8 case-- The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes. But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade(). Is it worth it?
> <plperl_encoding.patch>

Maybe I'm misunderstanding, but it seems to me that:

* String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.

* Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.

I didn't really follow all of the above; are you aiming for the same thing?

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 04:39:34
Message-ID:	AANLkTimZXAk3RVC7tSXkMtUCR9shYVAC7=uyPnwvQZK+@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 16, 2010 at 20:24, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
>
>> You might argue this is a bug with URI::Escape as I *think* all uri's
>> will be utf8 encoded. Anyway, I think postgres is doing the right
>> thing here.
>
> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.

So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Im saying they are not, and if you want \xc3\xa9 to be treated as
chr(233) you need to tell perl what encoding the string is in (err
well actually decode it so its in "perl space" as unicode characters
correctly).

> Maybe I'm misunderstanding, but it seems to me that:
>
> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.

Currently postgres has 2 behaviors:
1) If the database is utf8, turn on the utf8 flag. According to the
perldoc snippet I quoted this should mean its a sequence of utf8 bytes
and should interpret it as such.
2) its not utf8, so we just leave it as octets.

So in "perl space" length($_[0]) returns the number of characters when
you pass in a multibyte char *not* the number of bytes. Which is
correct, so um check we do that. Right?

In the URI::Escape example we have:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use URI::Escape;
warn(length($_[0]));
return uri_unescape($_[0]); $$ LANGUAGE plperlu;

# select url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING: 38 at line 2

Ok that length looks right, just for grins lets try add one multibyte char:

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
WARNING: 39 CONTEXT: PL/Perl function "url_decode" at line 2.
url_decode
-------------------------------
comment passer le rÃ©veillon☺
(1 row)

Still right, now lets try the utf8::decode version that "works". Only
lets look at the length of the string we are returning instead of the
one we are passing in:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use URI::Escape;
utf8::decode($_[0]);
my $str = uri_unescape($_[0]);
warn(length($str));
return $str;
$$ LANGUAGE plperlu;

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING: 28 at line 5.
CONTEXT: PL/Perl function "url_decode"
url_decode
-----------------------------
comment passer le réveillon
(1 row)

Looks harmless enough...

# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING: 28 at line 5.
CONTEXT: PL/Perl function "url_decode"
length
--------
27
(1 row)

Wait a minute... those lengths should match.

Post patch they do:
# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING: 28 at line 5.
CONTEXT: PL/Perl function "url_decode"
length
--------
28
(1 row)

Still confused? Yeah me too. Maybe this will help:

#!/usr/bin/perl
use URI::Escape;
my $str = uri_unescape("%c3%a9");
die "first match" if($str =~ m/\xe9/);
utf8::decode($str);
die "2nd match" if($str =~ m/\xe9/);

gives:
$ perl t.pl
2nd match at t.pl line 6.

see? Either uri_unescape() should be decoding that utf8() or you need
to do it *after* you call uri_unescape(). Hence the maybe it could be
considered a bug in uri_unescape().

> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
> I didn't really follow all of the above; are you aiming for the same thing?

Yeah, the patch address this part. Right now we just spit out
whatever the internal format happens to be.

Anyway its all probably clear as mud, this part of perl is one of the
hardest IMO.

From:	David Fetter <david(at)fetter(dot)org>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 15:30:52
Message-ID:	20101217153052.GA6259@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 16, 2010 at 07:24:46PM -0800, David Wheeler wrote:
> On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
>
> > Grr that should error out with "Invalid server encoding", or worst
> > case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> > instead of leaving it as 1). In this case the 333 causes perl
> > store it internally as utf8.
>
> Well with SQL_ASCII anything goes, no?

Anything except a byte that's all 0s :(

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 18:51:14
Message-ID:	AANLkTinVt6N42==UkVrhJ0rjza__LSwC4g027cd-damD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 17, 2010 at 08:30, David Fetter <david(at)fetter(dot)org> wrote:
> On Thu, Dec 16, 2010 at 07:24:46PM -0800, David Wheeler wrote:
>> On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
>>
>> > Grr that should error out with "Invalid server encoding", or worst
>> > case should return a length of 3 (it utf8 encoded 128 into 2 bytes
>> > instead of leaving it as 1). In this case the 333 causes perl
>> > store it internally as utf8.
>>
>> Well with SQL_ASCII anything goes, no?
>
> Anything except a byte that's all 0s :(

Keyword being byte, right? Last time I checked 333 wont fit in a byte
:P. In this case perl stores "333" as the utf8 that represents the
unicode code point 333. Postgres uses whatever that internal
representation is, so in our SQL_ASCII database we actually end up
getting back utf8 which _is_ valid SQL_ASCII, but I wouldn't call it
"right". The behavior im aiming for is similar to what the built-in
chr does:
# SELECT chr(333);
ERROR: requested character too large for encoding: 333

Also note this is just a simple test case, perl *could* elect to store
completely ascii strings internally as utf8. In those cases we still
"do the wrong thing", that is we get back utf8ified bytes :( Although
obviously from the lack of bug reports its quite uncommon.

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-17 19:07:11
Message-ID:	AANLkTi=tkJxB5Ry=gT5ATbjqRgu7YCqYFFc0S7gY4YWL@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 17, 2010 at 11:51, Alex Hunsaker <badalex(at)gmail(dot)com> wrote:
> Also note this is just a simple test case, perl *could* elect to store
> completely ascii strings internally as utf8. In those cases we still

Erm... not ascii I mean bytes >127

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 01:04:47
Message-ID:	ACAB08D1-A956-44F5-8200-36EB0404DD6A@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:

>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
>
> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, given a scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on.

> Im saying they are not, and if you want \xc3\xa9 to be treated as
> chr(233) you need to tell perl what encoding the string is in (err
> well actually decode it so its in "perl space" as unicode characters
> correctly).

PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from Perl's internal format on output.

>> Maybe I'm misunderstanding, but it seems to me that:
>>
>> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.
>
> Currently postgres has 2 behaviors:
> 1) If the database is utf8, turn on the utf8 flag. According to the
> perldoc snippet I quoted this should mean its a sequence of utf8 bytes
> and should interpret it as such.

Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?

> 2) its not utf8, so we just leave it as octets.

Which mean's Perl will assume that it's Latin-1, IIUC.

> So in "perl space" length($_[0]) returns the number of characters when
> you pass in a multibyte char *not* the number of bytes. Which is
> correct, so um check we do that. Right?

Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:

CREATE OR REPLACE FUNCTION perlgets(
TEXT
) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
my $text = shift;
return_next {
length => length $text,
is_utf8 => utf8::is_utf8($text) ? 1 : 0
};
$$;

In a utf-8 database:

utf8=# select * from perlgets('foo');
length │ is_utf8
────────┼─────────
8 │ t
(1 row)

In a latin-1 database:

latin=# select * from perlgets('foo');
length │ is_utf8
────────┼─────────
8 │ f
(1 row)

I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.

Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.

> In the URI::Escape example we have:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use URI::Escape;
> warn(length($_[0]));
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> # select url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING: 38 at line 2

What's the output? And what's the encoding of the database?

> Ok that length looks right, just for grins lets try add one multibyte char:
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
> WARNING: 39 CONTEXT: PL/Perl function "url_decode" at line 2.
> url_decode
> -------------------------------
> comment passer le rÃ©veillon☺
> (1 row)
>
> Still right,

The length is right, but the é is wrong. It looks like Perl thinks it's latin-1. Or, rather, unescape_uri() dosn't know that it should be returning utf-8 characters. That *might* be a bug in URI::Escape.

> now lets try the utf8::decode version that "works". Only
> lets look at the length of the string we are returning instead of the
> one we are passing in:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use URI::Escape;
> utf8::decode($_[0]);
> my $str = uri_unescape($_[0]);
> warn(length($str));
> return $str;
> $$ LANGUAGE plperlu;
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> url_decode
> -----------------------------
> comment passer le réveillon
> (1 row)
>
> Looks harmless enough...

Looks far better, in fact. Interesting that URI::Escape does the right thing only if the utf8 flag has been turned on in the string passed to it. But in Perl it usually won't be, because the encoded string should generally have only ASCII characters.

> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> length
> --------
> 27
> (1 row)
>
> Wait a minute... those lengths should match.
>
> Post patch they do:
> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> length
> --------
> 28
> (1 row)
>
> Still confused? Yeah me too.

Yeah…

> Maybe this will help:
>
> #!/usr/bin/perl
> use URI::Escape;
> my $str = uri_unescape("%c3%a9");
> die "first match" if($str =~ m/\xe9/);
> utf8::decode($str);
> die "2nd match" if($str =~ m/\xe9/);
>
> gives:
> $ perl t.pl
> 2nd match at t.pl line 6.
>
> see? Either uri_unescape() should be decoding that utf8() or you need
> to do it *after* you call uri_unescape(). Hence the maybe it could be
> considered a bug in uri_unescape().

Agreed.

>> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
>> I didn't really follow all of the above; are you aiming for the same thing?
>
> Yeah, the patch address this part. Right now we just spit out
> whatever the internal format happens to be.

Ah, excellent.

> Anyway its all probably clear as mud, this part of perl is one of the
> hardest IMO.

No question.

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 01:22:16
Message-ID:	B248634C-B273-4871-86F6-E809E9E4D41B@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 17, 2010, at 5:04 PM, David E. Wheeler wrote:

>> see? Either uri_unescape() should be decoding that utf8() or you need
>> to do it *after* you call uri_unescape(). Hence the maybe it could be
>> considered a bug in uri_unescape().
>
> Agreed.

On second thought, no. You can in fact encode anything in a URI. URI::Escape can't know what to decode to. So *I believe* it just unescapes the raw bytes. It might be handy for it to have a new function, though, to complement its uri_escape_utf() function:

sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(@_)) }

Just to make things a bit clearer.

But that's a separate issue from the, erm, inconsistency with which PL/Perl treats encoding and decoding of its inputs and outputs.

Best,

David

From:	David Christensen <david(at)endpoint(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 05:32:24
Message-ID:	AB507853-2325-4B0A-AEE1-A989436BF021@endpoint.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:

> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>
>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
>>
>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>
> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, given a scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on.

This is a correct assertion as to Perl's behavior. As far as PostgreSQL is/should be concerned in this case, this is the correct handling for URI::Escape, as the input string to the function was all ASCII (= valid UTF-8), so the function author would need to be responsible for the proper conversion to the internal encoding. This would just be a simple decode_utf8() call in the case that you've URI-escaped UTF-8-encoded unicode, however since URI escaping is only defined for octets, the meaning of those unescaped octets is detached from their encoding. There are similar issues with using other modules which traditionally have not distinguished between characters and bytes (as an examples Digest::MD5); Digest::MD5 does not work on wide characters, as the algorithm only deals with octets, so you need to pick a target encoding for wide characters and encode the octets themselves rather than the characters.

>> Im saying they are not, and if you want \xc3\xa9 to be treated as
>> chr(233) you need to tell perl what encoding the string is in (err
>> well actually decode it so its in "perl space" as unicode characters
>> correctly).
>
> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from Perl's internal format on output.

+1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. In the case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardless of data type. For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably using the built-in conversions before passing it off to the first code path. A similar handling would need to be done for the return values, again datatype-dependent.

This certainly seems like it could be inefficient in the case that we're using a non-utf8 server_encoding (there are people who do?? :-P), however from the standpoint of correctness, any plperl function that deals with this data as characters (using common things like regexes, length, ord, chr, substring, \w metachars) will have the potential of operating incorrectly when provided with data that is in a different encoding than perl's internal format. One thought I had was that we could expose the server_encoding to the plperl interpreters in a special variable to make it easy to explicitly decode if we needed, but the problems with this are: a) there's no guarantee that Encode.pm will have a alias/support for the specific server_encoding name as provided by Pg, and b) in the case of plperl (i.e., not u), there are horrendous issues when trying to deal with Safe.pm and character encoding. Recent upgrades of the Encode module included with perl 5.10+ have caused issues wherein circular dependencies between Encode and Encode::Alias have made it impossible to load in a Safe container without major pain. (There may be some better options than I'd had on a previous project, given that we're embedding our own interpreters and accessing more through the XS guts, so I'm not ruling out this possibility completely).

Perhaps we could set a function attribute or similar which indicated that we wanted to decode the input properly on input, whether or not this should be the default, or at the very least expose a function to the plperl[u]? runtime that would decode/upgrade on demand without the caller of the function needing to know the encoding of the database it is running in. This would solve issue a), because any supported server_encoding would have an internal conversion to utf8, and solves b) because we're avoiding the conversion from inside Safe and simply running our XS function on the input data. (As much as I hate the ugliness of it, if we decide the decoding behavior shouldn't be the default we could even use one of those ugly function pragmas in the function bodies.)

>>> Maybe I'm misunderstanding, but it seems to me that:
>>>
>>> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.
>>
>> Currently postgres has 2 behaviors:
>> 1) If the database is utf8, turn on the utf8 flag. According to the
>> perldoc snippet I quoted this should mean its a sequence of utf8 bytes
>> and should interpret it as such.
>
> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?

I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure the parameters came in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify the string-like types/text data. (Perhaps any type which had a binary cast to text would be a sufficient definition here. Do domains automatically inherit binary casts from their originating types?)

>> 2) its not utf8, so we just leave it as octets.
>
> Which mean's Perl will assume that it's Latin-1, IIUC.

This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier. This would produce bogus results for any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general usage.

>> So in "perl space" length($_[0]) returns the number of characters when
>> you pass in a multibyte char *not* the number of bytes. Which is
>> correct, so um check we do that. Right?
>
> Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:
>
> CREATE OR REPLACE FUNCTION perlgets(
> TEXT
> ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
> my $text = shift;
> return_next {
> length => length $text,
> is_utf8 => utf8::is_utf8($text) ? 1 : 0
> };
> $$;
>
> In a utf-8 database:
>
> utf8=# select * from perlgets('foo');
> length │ is_utf8
> ────────┼─────────
> 8 │ t
> (1 row)

This example seems bogus; wouldn't length be 3 if this is the example text this was run with? Additionally, since all ASCII is trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was improperly handled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit) < length(encode_utf8($hi_bit)). I don't see that this test shows us much with the test case as given. The is_utf8() function merely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set on ascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit characters in the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal format) with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though the underlying data, as represented in perl will be identical).

> In a latin-1 database:
>
> latin=# select * from perlgets('foo');
> length │ is_utf8
> ────────┼─────────
> 8 │ f
> (1 row)
>
> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.

See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case) data that has not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation for the state of that flag.

> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.

I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do no encoding on the data itself, which would indicate that the utf8 flag for that variable would be unset. If this is not currently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no character semantics attached to it.

>> In the URI::Escape example we have:
>>
>> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
>> use URI::Escape;
>> warn(length($_[0]));
>> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>>
>> # select url_decode('comment%20passer%20le%20r%C3%A9veillon');
>> WARNING: 38 at line 2
>
> What's the output? And what's the encoding of the database?
>
>> Ok that length looks right, just for grins lets try add one multibyte char:
>>
>> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
>> WARNING: 39 CONTEXT: PL/Perl function "url_decode" at line 2.
>> url_decode
>> -------------------------------
>> comment passer le rÃ©veillon☺
>> (1 row)
>>
>> Still right,
>
> The length is right, but the é is wrong. It looks like Perl thinks it's latin-1. Or, rather, unescape_uri() dosn't know that it should be returning utf-8 characters. That *might* be a bug in URI::Escape.

I think this has been addressed by others in previous emails.

>> now lets try the utf8::decode version that "works". Only
>> lets look at the length of the string we are returning instead of the
>> one we are passing in:
>>
>> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
>> use URI::Escape;
>> utf8::decode($_[0]);
>> my $str = uri_unescape($_[0]);
>> warn(length($str));
>> return $str;
>> $$ LANGUAGE plperlu;
>>
>> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
>> WARNING: 28 at line 5.
>> CONTEXT: PL/Perl function "url_decode"
>> url_decode
>> -----------------------------
>> comment passer le réveillon
>> (1 row)
>>
>> Looks harmless enough...
>
> Looks far better, in fact. Interesting that URI::Escape does the right thing only if the utf8 flag has been turned on in the string passed to it. But in Perl it usually won't be, because the encoded string should generally have only ASCII characters.

I think you'll find that this "correct display" is actually an artifact of your terminal type being set to a UTF-8 compatible encoding and interpreting the raw output as the UTF-8 sequence in its output display; that returned count is actually the number of octets, compare:

$ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
28

$ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
27

>> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
>> WARNING: 28 at line 5.
>> CONTEXT: PL/Perl function "url_decode"
>> length
>> --------
>> 27
>> (1 row)
>>
>> Wait a minute... those lengths should match.
>>
>> Post patch they do:
>> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
>> WARNING: 28 at line 5.
>> CONTEXT: PL/Perl function "url_decode"
>> length
>> --------
>> 28
>> (1 row)
>>
>> Still confused? Yeah me too.
>
> Yeah…

As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version is 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle > 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first and then escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1).

>> Maybe this will help:
>>
>> #!/usr/bin/perl
>> use URI::Escape;
>> my $str = uri_unescape("%c3%a9");
>> die "first match" if($str =~ m/\xe9/);
>> utf8::decode($str);
>> die "2nd match" if($str =~ m/\xe9/);
>>
>> gives:
>> $ perl t.pl
>> 2nd match at t.pl line 6.
>>
>> see? Either uri_unescape() should be decoding that utf8() or you need
>> to do it *after* you call uri_unescape(). Hence the maybe it could be
>> considered a bug in uri_unescape().
>
> Agreed.

-1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it. Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is only defined to return octets.

>>> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
>>> I didn't really follow all of the above; are you aiming for the same thing?
>>
>> Yeah, the patch address this part. Right now we just spit out
>> whatever the internal format happens to be.
>
> Ah, excellent.

I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return (-> server_encoding). This should be for any character-type data insofar as it is feasible, but ISTR there is already datatype-specific marshaling occurring.

>> Anyway its all probably clear as mud, this part of perl is one of the
>> hardest IMO.
>
> No question.

There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a few things up.

Regards,

David
--
David Christensen
End Point Corporation
david(at)endpoint(dot)com

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	David Christensen <david(at)endpoint(dot)com>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 06:43:48
Message-ID:	AANLkTinch0U5CE5B8pNsSVu5bh7eOynsjKpugG4sfG92@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 17, 2010 at 22:32, David Christensen <david(at)endpoint(dot)com> wrote:
>
> On Dec 17, 2010, at 7:04 PM, David E. Wheeler wrote:
>
>> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:
>>
>>>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
>>>
>>> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?
>>
>> Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, given a scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on.
>
> This is a correct assertion as to Perl's behavior. As far as PostgreSQL is/should be concerned in this case, this is the correct handling for URI::Escape,

Right, so no postgres bug here.. Postgres showing Ã© instead of é is
right as far as its concerned.

>> PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from Perl's internal format on output.
>
> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. In the case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardless of data type.

Right and thats what we do for the above. Minus some mis-handling of
non character datatypes like bytea in the UTF-8 case.

> For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably using the built-in conversions before passing it off to the first code path. A similar handling would need to be done for the return values, again datatype-dependent.

Yeah, thats what we *should* do. Right now we just leave it as byte
soup for the user to decode/encode. :(

> [ correctness of perl character ops in the non utf8 case] One thought I had was that we could expose the server_encoding to the plperl interpreters in a special variable to make it easy to explicitly decode...

Should not need to do anything as complicated as that. Can just encode
the string to utf8 before we hand it off to perl.

[...]
> $ perl -MURI::Escape -e'print length(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon}))'
> 28
>
> $ perl -MEncode -MURI::Escape -e'print length(decode_utf8(uri_unescape(q{comment%20passer%20le%20r%C3%A9veillon})))'
> 27
[...]
> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version is 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle > 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first and then escape the encoded value, and

And why should it? properly escaped URIs should have all those
escaped, I imagine. Anyway not really relevant for postgres.

> b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1).

>>> Either uri_unescape() should be decoding that utf8() or you need
>>> to do it *after* you call uri_unescape(). Hence the maybe it could be
>>> considered a bug in uri_unescape().
>>
>> Agreed.
>
> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it.

-1? thats basically what I said: "... you need to do it (decode the
utf8) *after* you call uri_unescape"

> Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is only defined to return octets.

Ahh So -1 because I said maybe you could call it a bug in
uri_unescape(). Really, I was only saying you *might* be able to
consider it a bug-- or perhaps deficiency is a better word, in
uri_unescape iff URI's are defined to have escaped characters as a %
escaped utf8 sequence. I dont know that they do, so I don't know if
its a bug :)

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 06:46:54
Message-ID:	AANLkTi=AgYM4TFWNoW3FYmkEbVcqefm5EPEtN+_=8s1F@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 17, 2010 at 18:22, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 17, 2010, at 5:04 PM, David E. Wheeler wrote:
>
>>> see? Either uri_unescape() should be decoding that utf8() or you need
>>> to do it *after* you call uri_unescape(). Hence the maybe it could be
>>> considered a bug in uri_unescape().
>>
>> Agreed.
>
> On second thought, no. You can in fact encode anything in a URI. URI::Escape can't know what to decode to. So *I believe* it just unescapes the raw bytes. It might be handy for it to have a new function, though, to complement its uri_escape_utf() function:
>
> sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(@_)) }
>
> Just to make things a bit clearer.
>
> But that's a separate issue from the, erm, inconsistency with which PL/Perl treats encoding and decoding of its inputs and outputs.

Yay! So I think we can finally agree that for Oleg's original test
case postgres was getting right. I hope ? :)

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, David Christensen <david(at)endpoint(dot)com>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-18 06:53:36
Message-ID:	AANLkTinJTePYyVgFwj37dGL_b0xH8RBB4V3-cKzCemjN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 17, 2010 at 18:04, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:

> Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:
>
> CREATE OR REPLACE FUNCTION perlgets(
> TEXT
> ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
> my $text = shift;
> return_next {
> length => length $text,
> is_utf8 => utf8::is_utf8($text) ? 1 : 0
> };
> $$;
>
> In a utf-8 database:
>
> utf8=# select * from perlgets('foo');
> length │ is_utf8
> ────────┼─────────
> 8 │ t
> (1 row)
>
>
> In a latin-1 database:
>
> latin=# select * from perlgets('foo');
> length │ is_utf8
> ────────┼─────────
> 8 │ f
> (1 row)
>
> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.

Just to reiterate in a different way what David C. said, the flag is
irrelevant in this case. Begin set on that input string is the same as
it not being set.

per perldoc perlunicode:
The "UTF8" flag being on does not mean that there are any characters
of code points greater than 255 (or 127) in the scalar or that there
are even any characters in the scalar. What the "UTF8" flag means is
that the sequence of octets in the representation of the scalar is the
sequence of UTF-8 encoded code points of the characters of a string.
The "UTF8" flag being off means that each octet in this representation
encodes a single character with code point 0..255 within the string.

Basically perl has *2* internal forms and certain strings can be
represented in both.

> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.

Hrm, yeah that seems bogus. Ill have to play with that more.

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-19 01:22:03
Message-ID:	EEECFAAA-18DF-4303-A4A1-E2E3DF6A7CF0@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 17, 2010, at 10:46 PM, Alex Hunsaker wrote:

>> But that's a separate issue from the, erm, inconsistency with which PL/Perl treats encoding and decoding of its inputs and outputs.
>
> Yay! So I think we can finally agree that for Oleg's original test
> case postgres was getting right. I hope ? :)

Yes, and so was Perl.

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	David Christensen <david(at)endpoint(dot)com>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-19 03:29:34
Message-ID:	BBD0B93C-7B24-4C36-A44E-4863EC98CE6A@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 17, 2010, at 9:32 PM, David Christensen wrote:

> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. In the case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardless of data type. For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably using the built-in conversions before passing it off to the first code path. A similar handling would need to be done for the return values, again datatype-dependent.

> Recent upgrades of the Encode module included with perl 5.10+ have caused issues wherein circular dependencies between Encode and Encode::Alias have made it impossible to load in a Safe container without major pain. (There may be some better options than I'd had on a previous project, given that we're embedding our own interpreters and accessing more through the XS guts, so I'm not ruling out this possibility completely).

Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm.

>> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?
>
> I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure the parameters came in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify the string-like types/text data. (Perhaps any type which had a binary cast to text would be a sufficient definition here. Do domains automatically inherit binary casts from their originating types?)

Their labels are TEXT. I believe that the only type that should not be treated as text is bytea.

>>> 2) its not utf8, so we just leave it as octets.
>>
>> Which mean's Perl will assume that it's Latin-1, IIUC.
>
> This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier. This would produce bogus results for any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general usage.

Agreed.

> This example seems bogus; wouldn't length be 3 if this is the example text this was run with? Additionally, since all ASCII is trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was improperly handled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit) < length(encode_utf8($hi_bit)). I don't see that this test shows us much with the test case as given. The is_utf8() function merely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set on ascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit characters in the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal format) with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though the underlying data, as represented in perl will be identical).

Sorry, I probably had a pasto there. how about this?

utf8=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
7 │ t

latin=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
11 │ f

(Yes I used Latin-1 curly quotes in that last example). I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.

>
>> In a latin-1 database:
>>
>> latin=# select * from perlgets('foo');
>> length │ is_utf8
>> ────────┼─────────
>> 8 │ f
>> (1 row)
>>
>> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.
>
> See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case) data that has not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation for the state of that flag.

Right. I think that it *should* be decoded.

>> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.
>
> I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do no encoding on the data itself, which would indicate that the utf8 flag for that variable would be unset.

Right.

> If this is not currently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no character semantics attached to it.

It looks as though it is not handled that way. The utf8 flag *is* set on a bytea string passed to a PL/Perl function in a UTF-8 database.

> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version is 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle > 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first and then escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1).

Right.

> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it. Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is only defined to return octets.

Right, I think we're agreed on that count. I wouldn't mind seeing a uri_unescape_utf8() though, as it might prevent some confusion.

>>> Yeah, the patch address this part. Right now we just spit out
>>> whatever the internal format happens to be.
>>
>> Ah, excellent.
>
> I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return (-> server_encoding). This should be for any character-type data insofar as it is feasible, but ISTR there is already datatype-specific marshaling occurring.

Dunno about that.

> There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a few things up.

Yes, it helped, thanks!

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-19 07:16:01
Message-ID:	AANLkTi=EPzHCxdfM_nSx8s0Vcdw7dcwf7dZnxFRG1dzt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 17, 2010, at 9:32 PM, David Christensen wrote:
> latin=# SELECT * FROM perlgets('“hello”');
> length │ is_utf8
> ────────┼─────────
> 11 │ f
>
> (Yes I used Latin-1 curly quotes in that last example).

Erm, latin1 does not have curly quotes, Windows-1252 does. Those are
utf8 quotes AFAICT so 11 is actually right (thats 3 bytes per quote so
that where 11 comes from). If latin1 did have quotes and you used
them you would have gotten the same answer as latin1 is a single byte
encoding. :) I think your terminal tripping you up here.

Postgres also gives the same length:
latin1=# select length('“hello”');
length
--------
11
(1 row)

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-19 08:20:27
Message-ID:	AANLkTi=ODW=o2R7i2NOfj0tv6dqGCAsBcKesAiWYWMw4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> ...
> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.

Yeah, I don't think you will find anyone who disagrees :) PL/TCL and
PL/Python get this right FWIW. Anyway find attached a patch that does
just this.

With the attached we:
- for function arguments, convert (using pg_do_encoding_conversion) to
utf8 from the current database encoding. We also turn on the utf8
flag so perl knows it was given utf8. Pre patch things only really
worked for SQL_ASCII or PG_UTF8 databases. In practice everything
worked fine for single byte charsets. However things like uc() or
lc() on bytes with high bits set were probably broken.

- for output from perl convert from perls internal format to utf8
(using SvPVutf8()), and then convert that to the current database
encoding. This sounds unoptimized, but in the common case SvPVutf8()
should be a noop. Pre patch this was "random" (dependent on how perl
decided to represent the string internally) but it worked 99% of the
time (at least in the single byte charset or UTF8 cases).

- fix errors so they properly encode their message to the current
database encoding (pre patch we were doing no conversion at all,
similar to the output case were it worked most of the time)

- always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
plperl). Pre patch this only happened on a UTF8 database. That meant
multi-byte character regexs were broken on non utf8 databases.

-remove some old perl version checks for 5.6 and 5.8. We require
5.8.1 so these were nothing but clutter.

Something interesting to note is when we are SQL_ASCII,
pg_do_encoding_conversion() does nothing, yet we turn on the utf8
flag. This means if you pass in valid utf8 perl will treat it as
such. It also means on output it will hand utf8 back. Both PL/Tcl
and PL/Python do the same thing so I suppose its sensible to match
their behavior (and it was the lazy thing to do). The difference
being with PL/Python if you return said string you get "ERROR:
PL/Python: could not convert Python Unicode object to PostgreSQL
server encoding". While PL/Tcl and now Pl/perl give you back a utf8
version. For example:

(SQL_ASCII database)
=# select length('☺');
length
--------
3

=# CREATE FUNCTION tcl_len(text) returns text as $$ return [string
length $1] $$ language pltcl;
CREATE FUNCTION
postgres=# SELECT tcl_len('☺');
tcl_len
------------
1
(1 row)

=# CREATE FUNCTION py_len(str text) returns text as $$ return
len(str) $$ language plpython3;
=# SELECT py_len('☺');
py_len
--------
1
(1 row)

I wouldn't really say thats right, but its at least consistent...

This does *not* address the bytea issue where currently if you have
bytea input or output we try to encode that the same as any string. I
think thats going to be a bit more invasive and this patch should
stands on its own.

Attachment	Content-Type	Size
plperl_fix_enc.patch.gz	application/x-gzip	3.7 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-20 00:56:55
Message-ID:	CA0DD9F0-67D9-4467-AF6F-73285AA61F7D@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 19, 2010, at 12:20 AM, Alex Hunsaker wrote:

>> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.
>
> Yeah, I don't think you will find anyone who disagrees :) PL/TCL and
> PL/Python get this right FWIW. Anyway find attached a patch that does
> just this.
>
> With the attached we:

+1 Awesome. Should this go into the next commitfest? Or might it be considered a bug fix?

Best,

David

From:	David Christensen <david(at)endpoint(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-20 04:00:02
Message-ID:	A52064D4-3467-49B3-AE9E-14FF22B7A57D@endpoint.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote:

> On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david(at)kineticode(dot)com> wrote:
>> ...
>> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.
>
> Yeah, I don't think you will find anyone who disagrees :) PL/TCL and
> PL/Python get this right FWIW. Anyway find attached a patch that does
> just this.

Cool, thanks for taking this on.

> With the attached we:
> - for function arguments, convert (using pg_do_encoding_conversion) to
> utf8 from the current database encoding. We also turn on the utf8
> flag so perl knows it was given utf8. Pre patch things only really
> worked for SQL_ASCII or PG_UTF8 databases. In practice everything
> worked fine for single byte charsets. However things like uc() or
> lc() on bytes with high bits set were probably broken.

How does this deal with input records of composite type?

> - for output from perl convert from perls internal format to utf8
> (using SvPVutf8()), and then convert that to the current database
> encoding. This sounds unoptimized, but in the common case SvPVutf8()
> should be a noop. Pre patch this was "random" (dependent on how perl
> decided to represent the string internally) but it worked 99% of the
> time (at least in the single byte charset or UTF8 cases).
>
> - fix errors so they properly encode their message to the current
> database encoding (pre patch we were doing no conversion at all,
> similar to the output case were it worked most of the time)

This sounds good; I imagine in practice most errors contain just 7-bit ascii which should be acceptable in any server_encoding, but in the case where something is returned that is unable to be represented in the server_encoding (thinking values defined/used in the function itself), does it degrade to the current behavior, as opposed to fail or eat the error message without outputting?

> - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
> plperl). Pre patch this only happened on a UTF8 database. That meant
> multi-byte character regexs were broken on non utf8 databases.

This sounds good in general, but I think should be skipped if GetDatabaseEncoding() == SQL_ASCII.

> -remove some old perl version checks for 5.6 and 5.8. We require
> 5.8.1 so these were nothing but clutter.

+1. Can't complain about removing clutter :-).

> Something interesting to note is when we are SQL_ASCII,
> pg_do_encoding_conversion() does nothing, yet we turn on the utf8
> flag. This means if you pass in valid utf8 perl will treat it as
> such. It also means on output it will hand utf8 back. Both PL/Tcl
> and PL/Python do the same thing so I suppose its sensible to match
> their behavior (and it was the lazy thing to do). The difference
> being with PL/Python if you return said string you get "ERROR:
> PL/Python: could not convert Python Unicode object to PostgreSQL
> server encoding". While PL/Tcl and now Pl/perl give you back a utf8
> version. For example:
>
> (SQL_ASCII database)
> =# select length('☺');
> length
> --------
> 3
>
> =# CREATE FUNCTION tcl_len(text) returns text as $$ return [string
> length $1] $$ language pltcl;
> CREATE FUNCTION
> postgres=# SELECT tcl_len('☺');
> tcl_len
> ------------
> 1
> (1 row)
>
> =# CREATE FUNCTION py_len(str text) returns text as $$ return
> len(str) $$ language plpython3;
> =# SELECT py_len('☺');
> py_len
> --------
> 1
> (1 row)
>
> I wouldn't really say thats right, but its at least consistent...

I think this could/should be adequately handled by not calling the function when the DatabaseEncoding is SQL_ASCII. Since SQL_ASCII basically makes no assumptions about any representation other than arbitrary 8-bit encoding, this demonstrated behavior is more-or-less attaching incorrect semantics to values that are being returned, and is completely bunko IMHO. (Not that many people are hopefully running SQL_ASCII at this point, but you never know...) Also, I'd argue that pltcl and plpython should be fixed as well for the same reasons.

> This does *not* address the bytea issue where currently if you have
> bytea input or output we try to encode that the same as any string. I
> think thats going to be a bit more invasive and this patch should
> stands on its own.
> <plperl_fix_enc.patch.gz>

Yeah, I'm not sure how invasive that will end up being, or if there are other datatypes which should skip the text processing.

I noticed you moved the declaration of perl_sys_init_done; was that an independent bug fix, or did something in the patch require that?

Cheers,

David
--
David Christensen
End Point Corporation
david(at)endpoint(dot)com

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	David Christensen <david(at)endpoint(dot)com>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-20 07:39:17
Message-ID:	AANLkTimKvnkFirhsAtQZ6aw35vw8HfzycZ0pQeiN9_n_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 19, 2010 at 21:00, David Christensen <david(at)endpoint(dot)com> wrote:
>
> On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote:
>

>> With the attached we:
>> - for function arguments, convert (using pg_do_encoding_conversion) to
>> utf8 from the current database encoding.
>
> How does this deal with input records of composite type?

Same way it worked before just encoded properly. :)

>> - fix errors so they properly encode their message to the current
>> database encoding
>
>[...] does it degrade to the current behavior, as opposed to fail or eat the error message without outputting?

No it fails with an "invalid encoding message". I think thats the
best we can do. It certainly seems wrong to send invalid bytes back
to the client.

>> - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
>> plperl). [...]
>
> This sounds good in general, but I think should be skipped if GetDatabaseEncoding() == SQL_ASCII.

Why? What if you want to use a module that does utf8 things?

>> [ PLs handling strings as utf8 in SQL_ASCII databases ]

> I think this could/should be adequately handled by not calling the function when the DatabaseEncoding is SQL_ASCII. [...] > Also, I'd argue that pltcl and plpython should be fixed as well for the same reasons.

I don't disagree, but im not volunteering for it either. Without
having done any in depth analysis, the largest problem is they use
strings for most arguments. Strings need an encoding and pl/python
and pl/tcl want a utf8 string. So we need to convert from SQL_ASCII
to UTF8. Which like you say is nonsensical. So really in the
SQL_ASCII case we would need to switch to binary datatypes for those
languages. Perl is a bit more flexible so it should be easier to
'fix'.

>> This does *not* address the bytea issue where currently if you have
>> bytea input or output we try to encode that the same as any string. I
>> think thats going to be a bit more invasive and this patch should
>> stands on its own.
>> <plperl_fix_enc.patch.gz>
>
> Yeah, I'm not sure how invasive that will end up being, or if there are other datatypes which should skip the text processing.

Well, I don't think it will be too bad. As for examples of other
datatypes that we could skip text processing: int, float and numeric.
Right now we treat them as strings, which works fine for perl... but
its a tad inefficient. Ideally we could be using things like SvIV.

> I noticed you moved the declaration of perl_sys_init_done; was that an independent bug fix, or did something in the patch require that?

It just squashes an unused variable "perl_sys_init_done" warning when
perl is compiled with its own malloc. Of course that added another
warning "warning: ISO C90 forbids mixed declarations and code" :P

In further review over caffeine this morning I noticed there are a few
places I missed: plperl_build_tuple_result(), plperl_modify_tuple()
and Util.XS.

The first 2 pass perl hash keys for the column straight to
SPI_fnumber() without doing any conversion. In theory this means we
might not be able to look up columns with non ascii characters. I
missed this because those hash keys are declared as char* instead of
SV* and do not use any of the standard perl string macros.

Anyway I hope to have a fresh patch tomorrow.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-20 19:02:03
Message-ID:	AANLkTinB=Rqj+tGDNm52=Ee4=ag1j9eDjSYb8eUvf6ZY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 19, 2010 at 7:56 PM, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> +1 Awesome. Should this go into the next commitfest? Or might it be considered a bug fix?

CommitFest or no CommitFest, patches get applied when a committer
acquires enough round tuits. Putting then into the next CommitFest
just provides a backstop to make sure that we remember to think about
looking for some round tuits then if not sooner; it doesn't prevent
them from being applied sooner. And in fact typically there are 2-4
patches per CommitFest that are committed before the CommitFest
technically starts.

Unfortunately, I'm not going to be able to pick this patch up now or
then, due to lack of subject-matter expertise. I expect Andrew or Tom
would be the most likely candidates...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	David Christensen <david(at)endpoint(dot)com>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-22 04:19:51
Message-ID:	AANLkTineVH=mX-H4_8nJSjZxssaWiZR4KCQwj9kKoRG6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 00:39, Alex Hunsaker <badalex(at)gmail(dot)com> wrote:

> In further review over caffeine this morning I noticed there are a few
> places I missed: plperl_build_tuple_result(), plperl_modify_tuple()
> and Util.XS.

And here is v3, fixes the above and also makes sure to properly
encode/decode SPI arguments. Tested on a latin1 database with latin1
columns and utf8 with utf8 columns. Also passes make installcheck (of
course) and changes one or two things to make plperl.c warning free.

Attachment	Content-Type	Size
plperl_enc_v3.patch.gz	application/x-gzip	8.6 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-22 18:24:00
Message-ID:	EF32C258-514A-45C6-91F7-41FF1C599CE8@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 21, 2010, at 8:19 PM, Alex Hunsaker wrote:

> And here is v3, fixes the above and also makes sure to properly
> encode/decode SPI arguments. Tested on a latin1 database with latin1
> columns and utf8 with utf8 columns. Also passes make installcheck (of
> course) and changes one or two things to make plperl.c warning free.

Awesome. Would you add it to https://commitfest.postgresql.org/action/commitfest_view?id=9 please?

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8
Date:	2010-12-22 19:59:57
Message-ID:	AANLkTi=bOdU5+9W4e=KQ7TfNwt8=ZszAdn=+hW9PFAT=@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 22, 2010 at 11:24, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Dec 21, 2010, at 8:19 PM, Alex Hunsaker wrote:
>
>> And here is v3, [ ...]

> Awesome. Would you add it to https://commitfest.postgresql.org/action/commitfest_view?id=9 please?

Nah, I was willing to spend a couple of hours playing with different
encodings and thinking about ways to break it. But adding it to a
commit feast? Puhlease. In other news, Ive added it to the
commitfest.

From:	Andy Colson <andy(at)squeakycode(dot)net>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	plperlu problem with utf8 [REVIEW]
Date:	2011-01-15 21:20:38
Message-ID:	4D320FA6.3000005@squeakycode.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is a review of "plperl encoding issues"

https://commitfest.postgresql.org/action/patch_view?id=452

Purpose:
========
Your database uses one encoding, and passes data to perl in the same encoding, which perl is not prepared for (it assumes UTF-8). This patch makes sure data is encoded into UTF-8 before its passed to plperl then converts the response from UTF-8 back to the database encoding for storage.

My test:

ptest2=# create database ptest2 encoding 'EUC_JP' template template0;

I created a simple perl function that reverses the string. I don't know Japanese so I found a tattoo website that had sayings in Japanese... I picked: "I am awesome".

create or replace function preverse(x text) returns text as $$
my $tmp = reverse($_[0]);
return $tmp;
$$ LANGUAGE plperl;

Before the patch:

ptest2=#select preverse('私はよだれを垂らす');

preverse
--------------------
垢蕕眇鬚譴世茲呂篁
(1 row)

It is also possible to generate invalid characters. This function pulls off the last character in the string... assuming its UTF-8

create or replace function plastchar(x text) returns text as $$
my $tmp = substr($_[0], -1);
return $tmp;
$$ LANGUAGE plperl;

ptest2=# select plastchar('私はよだれを垂らす');

ERROR: invalid byte sequence for encoding "EUC_JP": 0xb9
CONTEXT: PL/Perl function "plastchar"

Because the string was not UTF-8, perl got confused and returned an invalid character.

After the patch:
The exact same plperl functions work fine:

ptest2=# select preverse('私はよだれを垂らす');

preverse
--------------------
すら垂をれだよは私
(1 row)

ptest2=# select plastchar('私はよだれを垂らす');

plastchar
-----------
す
(1 row)

Performance:
============
This is a bug fix, not for performance, however, as noted by the author, many encodings will be very UTF-8'ish and the overhead will be very small. For those encodings that would need converted, you'd need to do the same convert inside your perl function anyway before you could use the data. The processing has just moved from inside your perl func to inside PG.

The Patch:
==========
Applies clean to git head as of January 15 2011. PG built with --enable-cassert and --enable-debug seems to run fine with no errors.

I don't think regression tests cover plperl, so understandable there are no tests in the patch.

There is no manual updates in the patch either, and I think there should be. I think it should be made clear
that data (varchar, text, etc. but not bytea) will be passed to perl as UTF-8, regardless of database encoding. Also that "use utf8;" is always loaded and in use.

Code Review:
============
I am not qualified. Looking through the patch, I'm reminded of the old saying: "Any sufficently advanced perl XS code is indistinguishable from magic" :-)

Other Remarks:
==============
- Yes I know... it was a joke.
- I sure hope this posts to the news group ok
- My terminal (konsole) had a hard time displaying Japanese, so I used psql's \i and \o to read/write files that kwrite show'd/encoded correctly via EUC_JP

Summary:
========
Looks good. Looks needed. Needs manual updates.

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	Andy Colson <andy(at)squeakycode(dot)net>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8 [REVIEW]
Date:	2011-01-17 01:14:09
Message-ID:	AANLkTik98uQNYr04_5WGfELZ1O9pSt4fi3VVfrTRdR+R@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 15, 2011 at 14:20, Andy Colson <andy(at)squeakycode(dot)net> wrote:
>
> This is a review of "plperl encoding issues"
>
> https://commitfest.postgresql.org/action/patch_view?id=452

Thanks for taking the time to review!

[...]
>
> The Patch:
> ==========
> Applies clean to git head as of January 15 2011. PG built with
> --enable-cassert and --enable-debug seems to run fine with no errors.
>
> I don't think regression tests cover plperl, so understandable there are no
> tests in the patch.

FWI there are plperl tests, you can do 'make installcheck' from the
plperl dir or installcheck-world from the top. However I did not add
any as AFAIK there is not a way to handle multiple locales with them
(at least for the automated case).

> There is no manual updates in the patch either, and I think there should be.
> I think it should be made clear
> that data (varchar, text, etc. but not bytea) will be passed to perl as
> UTF-8, regardless of database encoding

I don't disagree, but I dont see where to put it either. Maybe its
only release note material?

> Also that "use utf8;" is always loaded and in use.

Sorry, I probably mis-worded that in my original description. Its that
we always do the 'utf8fix' for plperl. Not that utf8 is loaded and in
use. This fix basically makes sure the unicode database and associated
modules are loaded. This is needed because perl will try to
dynamically load these when you need them. As we restrict 'require' in
the plperl case, things that depended on that would fail. Previously
we only did the utf8fix when we were a PG_UTF8 database. I don't
really think its worth documenting, its more a bug fix than anything
else.

From:	Andy Colson <andy(at)squeakycode(dot)net>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: plperlu problem with utf8 [REVIEW]
Date:	2011-01-17 04:03:48
Message-ID:	4D33BFA4.8060505@squeakycode.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/16/2011 07:14 PM, Alex Hunsaker wrote:
> On Sat, Jan 15, 2011 at 14:20, Andy Colson<andy(at)squeakycode(dot)net> wrote:
>>
>> This is a review of "plperl encoding issues"
>>
>> https://commitfest.postgresql.org/action/patch_view?id=452
>
> Thanks for taking the time to review!
>
> [...]
>>
>> The Patch:
>> ==========
>> Applies clean to git head as of January 15 2011. PG built with
>> --enable-cassert and --enable-debug seems to run fine with no errors.
>>
>> I don't think regression tests cover plperl, so understandable there are no
>> tests in the patch.
>
> FWI there are plperl tests, you can do 'make installcheck' from the
> plperl dir or installcheck-world from the top. However I did not add
> any as AFAIK there is not a way to handle multiple locales with them
> (at least for the automated case).

oh, cool. I'd kinda thought 'make check' was the one to run. I'll have to checkout 'make check' vs 'make installcheck'.

>> There is no manual updates in the patch either, and I think there should be.
>> I think it should be made clear
>> that data (varchar, text, etc. but not bytea) will be passed to perl as
>> UTF-8, regardless of database encoding
>
> I don't disagree, but I dont see where to put it either. Maybe its
> only release note material?
>

I think this page:
http://www.postgresql.org/docs/current/static/plperl-funcs.html

Right after:
"Arguments and results are handled as in any other Perl subroutine: arguments are passed in @_, and a result value is returned with return or as the last expression evaluated in the function."

Add:

Arguments will be converted from the databases encoding to UTF-8 for use inside plperl, and then converted from UTF-8 back to the database encoding upon return.

OR, that same sentence could be added to the next page:

http://www.postgresql.org/docs/current/static/plperl-data.html

However, this patch brings back DWIM to plperl. It should just work without having to worry about it. I'd be ok either way.

>> Also that "use utf8;" is always loaded and in use.
>
> Sorry, I probably mis-worded that in my original description. Its that
> we always do the 'utf8fix' for plperl. Not that utf8 is loaded and in
> use. This fix basically makes sure the unicode database and associated
> modules are loaded. This is needed because perl will try to
> dynamically load these when you need them. As we restrict 'require' in
> the plperl case, things that depended on that would fail. Previously
> we only did the utf8fix when we were a PG_UTF8 database. I don't
> really think its worth documenting, its more a bug fix than anything
> else.
>

Agreed.

-Andy