Quick Links

Re: Careful PL/Perl Release Not Required

Lists:	pgsql-hackers

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Careful PL/Perl Release Not Required
Date:	2011-02-10 23:28:31
Message-ID:	A0401CE6-F348-4F3A-AB91-1E6E5E88438A@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers,

With regard to this (very welcome) commit:

> commit 50d89d422f9c68a52a6964e5468e8eb4f90b1d95
> Author: Andrew Dunstan <andrew(at)dunslane(dot)net>
> Date: Sun Feb 6 17:29:26 2011 -0500
>
> Force strings passed to and from plperl to be in UTF8 encoding.
>
> String are converted to UTF8 on the way into perl and to the
> database encoding on the way back. This avoids a number of
> observed anomalies, and ensures Perl a consistent view of the
> world.
>
> Some minor code cleanups are also accomplished.
>
> Alex Hunsaker, reviewed by Andy Colson.

I just want to emphasize that this needs to be highlighted as a compatibility change in the release notes. As an example, I currently have this code in PGXN to process a TEXT param to a function:

my $dist_meta = JSON::XS->new->utf8->decode(shift);

After I upgrade to 9.0, I will have to change that to:

my $dist_meta = JSON::XS->new->utf8(0)->decode(shift);

The upshot is that in those cases where the raw bytes are what's actually wanted, users will have to modify their functions to turn off the utf8 flag.

This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:

> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;

To:

> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;

So this needs to be highlighted in the release notes: If a PL/Perl function is currently relying on a parameter passed in bytes, it will need to be modified to deal with utf8 strings, instead.

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 01:28:24
Message-ID:	AANLkTimSZFNcVKK0gthChUZ-WcB2s+mPjHzc=U4Vn8Li@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 10, 2011 at 16:28, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> Hackers,
>
> With regard to this (very welcome) commit:
>
>> commit 50d89d422f9c68a52a6964e5468e8eb4f90b1d95
>> Author: Andrew Dunstan <andrew(at)dunslane(dot)net>
>> Date: Sun Feb 6 17:29:26 2011 -0500
>>
>> Force strings passed to and from plperl to be in UTF8 encoding.
>>
>> String are converted to UTF8 on the way into perl and to the
>> database encoding on the way back. This avoids a number of
>> observed anomalies, and ensures Perl a consistent view of the
>> world.
>>
>> Some minor code cleanups are also accomplished.
>>
>> Alex Hunsaker, reviewed by Andy Colson.
>
> I just want to emphasize that this needs to be highlighted as a compatibility change in the release notes. As an example, I currently have this code in PGXN to process a TEXT param to a function:
>
> my $dist_meta = JSON::XS->new->utf8->decode(shift);
>
> After I upgrade to 9.0, I will have to change that to:
>
> my $dist_meta = JSON::XS->new->utf8(0)->decode(shift);

Hrm? For UTF-8 databases, in practice, nothing should have changed--
we already passed strings in as utf8. What I fixed was some corner
cases where some strings did not always have character semantics. See
The "Unicode Bug" and "Forcing Unicode in Perl" in perldoc perlunicode
for the problem and more or less how I fixed it.

The other thing that changed is non UTF-8 databases now also get
character semantics. That is we convert from the database encoding
into utf8 and visa versa on output. That probably should be noted
somewhere...

If you do have to change your semantics/functions, could you post an
example? I'd like to make sure its because you were hitting one of
those nasty corner cases and not something new is broken.

> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:
> ...

Well assuming he fixed his bug by encoding uri_unescape's output he
should not have to do anything. IIRC the problem was basically double
encoded utf8, not a postgres bug.

[ he had %3A%4A or something, uri_decode() decodes that to _two_
characters because _it_ knows nothing about utf8. so you would need to
call utf8::decode() on the result to turn those two bytes into a
character ]

> So this needs to be highlighted in the release notes: If a PL/Perl function is currently relying on a parameter passed in bytes, it will >need to be modified to deal with utf8 strings, instead.

FYI Andrew did add some docs.

Thanks for keeping a sharp eye out.

[ P.S. This stuff is confusing as hell, im just glad I got a sucker to
commit it *waves* at Andrew :-) ]

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 04:53:29
Message-ID:	D08A9F21-8162-4891-975D-C8F51737181A@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 10, 2011, at 5:28 PM, Alex Hunsaker wrote:

> Hrm? For UTF-8 databases, in practice, nothing should have changed--
> we already passed strings in as utf8. What I fixed was some corner
> cases where some strings did not always have character semantics. See
> The "Unicode Bug" and "Forcing Unicode in Perl" in perldoc perlunicode
> for the problem and more or less how I fixed it.

Uh…

try=# create function is_utf8(text) returns boolean language plperl AS 'utf8::is_utf8(shift)';
CREATE FUNCTION

try=# select is_utf8('whatever');
is_utf8
─────────
t
(1 row)

try=# select is_utf8(U&'\0441\043B\043E\043D');
is_utf8
─────────
t
(1 row)

Damn, I guess you're right. How did I miss that?

> The other thing that changed is non UTF-8 databases now also get
> character semantics. That is we convert from the database encoding
> into utf8 and visa versa on output. That probably should be noted
> somewhere...

Oh. I see. And Oleg's database wasn't utf-8 then, I guess. I'll have to re-read the JSON docs, I guess. Erm…feh. Okay. I have to pass the false value to utf8() *now*. Okay, at least that's more consistent.

> If you do have to change your semantics/functions, could you post an
> example? I'd like to make sure its because you were hitting one of
> those nasty corner cases and not something new is broken.

I think that people who have non-utf-8 databases might be surprised.

>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:
>> ...
>
> Well assuming he fixed his bug by encoding uri_unescape's output he
> should not have to do anything. IIRC the problem was basically double
> encoded utf8, not a postgres bug.

No, the problem was that the string was passed to his pl/perl function encoded in utf-8. He added a line to decode it to Perl's internal form. Once he goes to 9.1, unless the database is SQL_ASCII, he can dump the decode() line. I think.

> [ he had %3A%4A or something, uri_decode() decodes that to _two_
> characters because _it_ knows nothing about utf8. so you would need to
> call utf8::decode() on the result to turn those two bytes into a
> character ]

No, he had to add the decode line, IIRC:

CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
utf8::decode($_[0]);
return uri_unescape($_[0]); $$ LANGUAGE plperlu;

Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, in a latin-1 database:

latin=# create or replace function is_utf8(text) returns boolean language plperl AS 'utf8::is_utf8(shift) ? 1 : 0';
CREATE FUNCTION
Time: 1.934 ms
latin=# select is_utf8('whatever'); is_utf8
─────────
f
(1 row)

That will change, if I understand correctly.

>> So this needs to be highlighted in the release notes: If a PL/Perl function is currently relying on a parameter passed in bytes, it will >need to be modified to deal with utf8 strings, instead.
>
> FYI Andrew did add some docs.

Yeah, I was thinking of the release notes. Those who have non-uft-8 databases might be surprised if their PL/Perl functions expect strings to be passed as bytes.

> Thanks for keeping a sharp eye out.
>
> [ P.S. This stuff is confusing as hell, im just glad I got a sucker to
> commit it *waves* at Andrew :-) ]

Heh, well done. Frankly, though, this stuff isn't *that* hard. It's Perl's terminology that's really bad.

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 07:43:54
Message-ID:	AANLkTimp9yiGqAGLvwJifb1gvJ6xK0PUZh3td30BEU5C@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 10, 2011 at 21:53, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 10, 2011, at 5:28 PM, Alex Hunsaker wrote:

>> The other thing that changed is non UTF-8 databases now also get
>> character semantics. That is we convert from the database encoding
>> into utf8 and visa versa on output. That probably should be noted
>> somewhere...
>
> Oh. I see. And Oleg's database wasn't utf-8 then, I guess. I'll have to re-read the JSON docs, I guess. Erm…feh. Okay. I have to pass the false value to utf8() *now*. Okay, at least that's more consistent.

I'd like to quibble with you over this point if I may. :-)
Per perldoc: JSON::XS
"utf8" flag disabled
When "utf8" is disabled (the default), then
"encode"/"decode" generate and expect Unicode strings ...

So
- If you are on < 9.1 and a utf8 database you want to pass
utf8(false), as you have a Unicode string.

- If you are on < 9.1 and on a non utf8 database you would want to
pass utf8(false) as the string is *not* Unicode, its byte soup. Its in
some _other_ encoding say EUC_JP. You would need to decode() it into
Unicode first.

- If you are on 9.1 and a utf8 database you still want to pass
utf8(false) as the string is still unicode.

- if you are on 9.1 and a non utf8 database you want to pass
utf8(false) as the string is _now_ unicode.

So... it seems you always want to pass false. The only case I can
where you would want to pass true is you are on < 9.1 with a SQL_ASCII
database and you know for a fact the string represents a utf8 byte
sequence.

Or am I missing something obvious?

>> If you do have to change your semantics/functions, could you post an
>> example? I'd like to make sure its because you were hitting one of
>> those nasty corner cases and not something new is broken.
>
> I think that people who have non-utf-8 databases might be surprised.

Yeah, surprised it does the right thing and its actually usable now ;).

>>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:

> No, he had to add the decode line, IIRC:
>
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, in a latin-1 database:

Meh, no, not really. He will still need to call decode. The problem is
uri_unescape() does not assume an encoding on the URI. It could be
UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats
not the point, it knows nothing about Unicode or encodings).

For example, lets say you have a latin-1 accented e "é" the byte
sequence is the one byte: 0xe9. If you were to uri_escape that you get
the 3 byte ascii string "%E9":
$ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)'
%E9

If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: e9, len: 1

What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9:
$ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)'
%00%E9

What happens we uri_unescape that? Do we get back a Unicode string
that has one character? No. And why should we? How is uri_unescape
supposed to know what %00%E9 represent? All it knows is thats 2
separate bytes:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: 00e9, len: 2

Now, lets say you want to uri_escape a utf8 accented e, thats the two
byte sequence: 0xc3 0xa9:
$ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)'
%C3%A9

Ok, what happens when we uri_unescape those?:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: c3a9, len: 2

So, plperl will also return 2 characters here.

In the the cited case he was passing "%C3%A9" to uri_unescape() and
expecting it to return 1 character. The additional utf8::decode() will
tell perl the string is in utf8 so it will then return 1 char. The
point being, decode is needed and with it, the function will work pre
and post 9.1.

In-fact on a latin-1 database it sure as heck better return two
characters, it would be a bug if it only returned 1 as that would mean
it would be treating a series of latin1 bytes as a series of utf8
bytes!

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 17:16:36
Message-ID:	9C246AAB-E67A-4F23-AA5E-66B59BC4F876@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> I'd like to quibble with you over this point if I may. :-)
> Per perldoc: JSON::XS
> "utf8" flag disabled
> When "utf8" is disabled (the default), then
> "encode"/"decode" generate and expect Unicode strings ...
>
> So
> - If you are on < 9.1 and a utf8 database you want to pass
> utf8(false), as you have a Unicode string.

Right. That's what I realized yesterday, thanks to our exchange. I updated my code for that. The use of the term "Unicode string" in the JSON::XS docs is really confusing, though. A scalar with the utf8 flag on is not a unicode string. It's Perl's representation of a string. It has no encoding (it's "decoded").

Like I said, the terminology is awful.

> - If you are on < 9.1 and on a non utf8 database you would want to
> pass utf8(false) as the string is *not* Unicode, its byte soup. Its in
> some _other_ encoding say EUC_JP. You would need to decode() it into
> Unicode first.

Or use utf8() or utf8(1). Then JSON::XS will decode it for you.

> - If you are on 9.1 and a utf8 database you still want to pass
> utf8(false) as the string is still unicode.
>
> - if you are on 9.1 and a non utf8 database you want to pass
> utf8(false) as the string is _now_ unicode.

Right.

> So... it seems you always want to pass false. The only case I can
> where you would want to pass true is you are on < 9.1 with a SQL_ASCII
> database and you know for a fact the string represents a utf8 byte
> sequence.
>
> Or am I missing something obvious?

Yes, that you can pass no value to utf8() or a true value and it will decode a utf-8-encoded string for you.

>>> If you do have to change your semantics/functions, could you post an
>>> example? I'd like to make sure its because you were hitting one of
>>> those nasty corner cases and not something new is broken.
>>
>> I think that people who have non-utf-8 databases might be surprised.
>
> Yeah, surprised it does the right thing and its actually usable now ;).

Yes, but they might need to change their code, is what I'm saying.

>
>>>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from:
>
>> No, he had to add the decode line, IIRC:
>>
>> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
>> use strict;
>> use URI::Escape;
>> utf8::decode($_[0]);
>> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>>
>> Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't need to call utf8::decode(). That is, in a latin-1 database:
>
> Meh, no, not really. He will still need to call decode.

Why? In 9.1, won't params from passed to PL/Perl functions in non-SQL_ASCII databases already be decoded?

> The problem is
> uri_unescape() does not assume an encoding on the URI. It could be
> UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats
> not the point, it knows nothing about Unicode or encodings).

Yes, but if you don't want surprises, I think you want to pass a decoded string to it.

> For example, lets say you have a latin-1 accented e "é" the byte
> sequence is the one byte: 0xe9. If you were to uri_escape that you get
> the 3 byte ascii string "%E9":
> $ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)'
> %E9
>
> If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: e9, len: 1
>
> What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9:
> $ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)'
> %00%E9
>
> What happens we uri_unescape that? Do we get back a Unicode string
> that has one character? No. And why should we? How is uri_unescape
> supposed to know what %00%E9 represent? All it knows is thats 2
> separate bytes:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: 00e9, len: 2

Yeah, this is why URI::Escape needs a uri_unescape_utf8() function to complement utf8_escape_utf8(). But to get around that, you would of course decode the return value yourself.

> Now, lets say you want to uri_escape a utf8 accented e, thats the two
> byte sequence: 0xc3 0xa9:
> $ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)'
> %C3%A9
>
> Ok, what happens when we uri_unescape those?:
> $ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
> sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
> $str)'
> chr: é hex: c3a9, len: 2
>
> So, plperl will also return 2 characters here.
>
> In the the cited case he was passing "%C3%A9" to uri_unescape() and
> expecting it to return 1 character. The additional utf8::decode() will
> tell perl the string is in utf8 so it will then return 1 char. The
> point being, decode is needed and with it, the function will work pre
> and post 9.1.

Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database was utf-8, and should be in 9.1 if the database isn't sql_ascii?

> In-fact on a latin-1 database it sure as heck better return two
> characters, it would be a bug if it only returned 1 as that would mean
> it would be treating a series of latin1 bytes as a series of utf8
> bytes!

If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's Perl's internal representation.

If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded.

Best,

David

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 17:34:43
Message-ID:	4D557333.7040107@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/11/2011 12:16 PM, David E. Wheeler wrote:

[long discussion elided]

Is there an action item here? Is the documentation inadequate or
inaccurate? If so, please make a suggestion for improved wording.

I certainly expect the change to be covered in the release notes. You
can check on the prominence given the item then and protest if you don't
think it's adequate.

cheers

andrew

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 17:44:56
Message-ID:	AANLkTimkORLgN6ib63rkZ9OjZv5jDpBpe8E+OEk-oXL-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 11, 2011 at 10:16, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> Like I said, the terminology is awful.

Yeah I use encode and decode to mean the same thing frequently :-(.

>> In the the cited case he was passing "%C3%A9" to uri_unescape() and
>> expecting it to return 1 character. The additional utf8::decode() will
>> tell perl the string is in utf8 so it will then return 1 char. The
>> point being, decode is needed and with it, the function will work pre
>> and post 9.1.
>
> Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database was utf-8, and should be in 9.1 if the database isn't sql_ascii?

It is decoded... the input string "%C3%A9" actually is the _same_
string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
characters. Calling utf8::decode("%C3%A9") is essentially a noop.

>> In-fact on a latin-1 database it sure as heck better return two
>> characters, it would be a bug if it only returned 1 as that would mean
>> it would be treating a series of latin1 bytes as a series of utf8
>> bytes!
>
> If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's Perl's internal representation.

> If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded.

Ok, I think i figured out why we seem to be talking past each other, we have:
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
utf8::decode($_[0]);
return uri_unescape($_[0]); $$ LANGUAGE plperlu;

That *looks* like it is decoding the input string, which it is, but
actually that will double utf8 encode your string. It does not seem to
in this case because we are dealing with all ascii input. The trick
here is its also telling perl to decode/treat the *output* string as
utf8.

uri_unescape() returns the same string you passed in, which thanks to
the utf8::decode() above has the utf8 flag set. Meaning we end up
treating it as 1 character instead of two. Or basically that it has
the same effect as calling utf8::decode() on the return value.

The correct way to write that function pre 9.1 and post 9.1 would be
(in a utf8 database):
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
my $str = uri_unescape($_[0]);
utf8::decode($str);
return $str;
$$ LANGUAGE plperlu;

The last utf8::decode being optional (as we said, it might not be
utf8), but granting the sought behavior by the op.

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 18:01:37
Message-ID:	AANLkTinudXQeMZpqLMxfxZqjRPGPGDkLrwmaQFOxLrn7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 11, 2011 at 10:44, Alex Hunsaker <badalex(at)gmail(dot)com> wrote:
> On Fri, Feb 11, 2011 at 10:16, David E. Wheeler <david(at)kineticode(dot)com> wrote:

> That *looks* like it is decoding the input string, which it is, but
> actually that will double utf8 encode your string. It does not seem to
> in this case because we are dealing with all ascii input. The trick
> here is its also telling perl to decode/treat the *output* string as
> utf8.

Urp, this is a bit of a fib. The problem is actual in plperl not perl
persay. Pre 9.1 we always fetched perls internal string *ignoring* the
utf8 flag. So if you had octets that were utf8 things would work. The
utf8::decode($_[0]); uri_unescape($_[0]); happened to make the return
string internally be utf8 and so it would only return 1 char. Thats
what the op wanted and why it seemed to fix his problem. But thats
actually a bug! utf8::decode($_[0]) should not have changed anything
at all on the output side. It should still have returned 2 characters
instead of 1.

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 18:04:57
Message-ID:	0DA44369-C0F1-4C9D-A158-48688D37A6CC@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 11, 2011, at 9:44 AM, Alex Hunsaker wrote:

> It is decoded... the input string "%C3%A9" actually is the _same_
> string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
> characters. Calling utf8::decode("%C3%A9") is essentially a noop.

No, it's not decoded. It doesn't matter because they're ASCII bytes. But if the utf8 flag isn't set, it's not decoded. It's just byte soup as far as Perl is concerned. Unless I grossly misunderstand something, which is entirely possible.

> Ok, I think i figured out why we seem to be talking past each other, we have:
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> That *looks* like it is decoding the input string, which it is, but
> actually that will double utf8 encode your string. It does not seem to
> in this case because we are dealing with all ascii input. The trick
> here is its also telling perl to decode/treat the *output* string as
> utf8.
>
> uri_unescape() returns the same string you passed in, which thanks to
> the utf8::decode() above has the utf8 flag set. Meaning we end up
> treating it as 1 character instead of two. Or basically that it has
> the same effect as calling utf8::decode() on the return value.
>
> The correct way to write that function pre 9.1 and post 9.1 would be
> (in a utf8 database):
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> my $str = uri_unescape($_[0]);
> utf8::decode($str);
> return $str;
> $$ LANGUAGE plperlu;
>
> The last utf8::decode being optional (as we said, it might not be
> utf8), but granting the sought behavior by the op.

No. If the argument to PL/Perl has the utf8 flag set, then that's what you always get. The utf8::decode() isn't necessary because it's already decoded:

> perl -MURI::Escape -MEncode -E 'say utf8::is_utf8(uri_unescape(Encode::decode_utf8("“hi”")))'
1

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Alex Hunsaker <badalex(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 18:06:29
Message-ID:	E590265D-45A0-49B0-AC05-5E1C19CF8CB9@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 11, 2011, at 9:34 AM, Andrew Dunstan wrote:

> Is there an action item here? Is the documentation inadequate or inaccurate? If so, please make a suggestion for improved wording.
>
> I certainly expect the change to be covered in the release notes. You can check on the prominence given the item then and protest if you don't think it's adequate.

You can ignore Alex and me till we work it out, if you like, and then we'll let you know directly if something needs to be changed/updated.

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 18:07:40
Message-ID:	8B4CC23A-6659-4EF2-856E-1B957E824EF1@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 11, 2011, at 10:01 AM, Alex Hunsaker wrote:

>> That *looks* like it is decoding the input string, which it is, but
>> actually that will double utf8 encode your string. It does not seem to
>> in this case because we are dealing with all ascii input. The trick
>> here is its also telling perl to decode/treat the *output* string as
>> utf8.
>
> Urp, this is a bit of a fib. The problem is actual in plperl not perl
> persay. Pre 9.1 we always fetched perls internal string *ignoring* the
> utf8 flag. So if you had octets that were utf8 things would work.

In 9.0 in a utf-8 database, the utf8 flag is turned on.

> The
> utf8::decode($_[0]); uri_unescape($_[0]); happened to make the return
> string internally be utf8 and so it would only return 1 char. Thats
> what the op wanted and why it seemed to fix his problem. But thats
> actually a bug! utf8::decode($_[0]) should not have changed anything
> at all on the output side. It should still have returned 2 characters
> instead of 1.

I don't understand where the bug is. If a string is encoded in utf-8 Perl will not treat it as such unless the utf-8 flag is set.

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 20:57:26
Message-ID:	AANLkTi=+EpO9XBwhP++WuBgTvQ4jE4ywSM=p5xvE1QH1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 11, 2011 at 11:07, David E. Wheeler <david(at)kineticode(dot)com> wrote:

> I don't understand where the bug is. If a string is encoded in utf-8 Perl will not treat it as such unless the utf-8 flag is set.

Ok so I think we agreed this is right:
$ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say
sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length
$str)'
chr: é hex: c3a9, len: 2

Key part here is len = 2, or 2 characters.

Lets try that in a postgres 9.0 utf8 database:
=> create or replace function uri_decode(txt text, in_decode int,
out_decode int) returns text as $$
use URI::Escape;
my $str = shift;
utf8::decode($str) if(shift);
$str = uri_unescape($str);
utf8::decode($str) if(shift);
return $str;
$$ language plperlu;

-- For ease we are just going to look at the length as most terminals
will have utf8 and latin1 mapped.
=> SELECT length(uri_decode('%c3%a9', 0, 0));
length
--------
2
(1 row)

Looks right.

What happens if we decode after uri_unescape, we should get 1 character no?
-- decode after uri_unescape
=> SELECT length(uri_decode('%c3%a9', 0, 1));
length
--------
1

Ok thats right.

What happens if we decode before? Nothing should right? After all
'%c3%a9' is all asci. We should still get 2 characters.
=> SELECT length(uri_decode('%c3%a9', 1, 0));
length
--------
1

Whoa! 1? Does vanilla perl do that?:
perl <<'perl'
use URI::Escape;
my $str = '%c3%a9';
utf8::decode($str);
$str = uri_unescape($str);
print sprintf("chr: %s hex: %s, len: %s\n", $str, unpack("H*", $str), length
$str);
perl
chr: é hex: c3a9, len: 2

Nope, so postgres gets it wrong here. Thats the problem.

In 9.1 it does "the right thing":

=> SELECT length(uri_decode(0, 0));
length
--------
2

Yay! 2!

=> SELECT length(uri_decode(1, 0));
CONTEXT: PL/Perl function "uri_decode"
length
--------
2

Yay! also 2!

=> SELECT length(uri_decode(0, 1));
length
--------
1

Yay! 1

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 23:42:40
Message-ID:	944528FC-AB0D-4576-BFEB-C5CA464BDB6B@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 11, 2011, at 12:57 PM, Alex Hunsaker wrote:

> Yay! 1

Yes, this is all good. But it still seems to me that:

* If your database is neither utf-8 no sql_ascii
* And your PL/Perl functions expect arguments that are byte soup
* Once you upgrade to 9.1 they won't be
* So you'll need to encode them.

Unless I've completely misunderstood the change in how arguments are passed.

Best,

David

From:	Alex Hunsaker <badalex(at)gmail(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-11 23:58:37
Message-ID:	AANLkTinQvn6K8=jAqoUqCtXcRp9F1vx2de=V_FA09V8Z@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 11, 2011 at 16:42, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 11, 2011, at 12:57 PM, Alex Hunsaker wrote:
>
>> Yay! 1
>
> Yes, this is all good. But it still seems to me that:
>
> * If your database is neither utf-8 no sql_ascii

You mean... we have been talking past each other this whole time?
Olegs case _was_ a utf8 database.
>From his original bug:

>> Hi there, below is the problem, which I don't have when running in shell. The database is in UTF-8 encoding.

Thats why I have been fighting the notion that he can finally get rid
of the utf8::decode(). The utf8::decode() _before_ uri_unescape() was
wrong, it "fixed" his problem but that was really a bug. The
utf8::decode() after uri_unescape() is the right answer. And he will
still need that pre and post patch.

> * And your PL/Perl functions expect arguments that are byte soup
> * Once you upgrade to 9.1 they won't be
> * So you'll need to encode them.

Yeah, I think we all agree it should be mentioned in the incompatible
section of the release notes. :-)

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alex Hunsaker <badalex(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Careful PL/Perl Release Not Required
Date:	2011-02-12 02:02:06
Message-ID:	E06422AC-497D-4A3A-9BBA-66699C38FD22@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 11, 2011, at 3:58 PM, Alex Hunsaker wrote:

> You mean... we have been talking past each other this whole time?

Well, since my second post, I think. I was wrong in the first one.

> Olegs case _was_ a utf8 database.
> From his original bug:
>
>>> Hi there, below is the problem, which I don't have when running in shell. The database is in UTF-8 encoding.

Ah. Stupid of me not to have seen that part.

> Thats why I have been fighting the notion that he can finally get rid
> of the utf8::decode(). The utf8::decode() _before_ uri_unescape() was
> wrong, it "fixed" his problem but that was really a bug. The
> utf8::decode() after uri_unescape() is the right answer. And he will
> still need that pre and post patch.

Right.

>> * And your PL/Perl functions expect arguments that are byte soup
>> * Once you upgrade to 9.1 they won't be
>> * So you'll need to encode them.
>
> Yeah, I think we all agree it should be mentioned in the incompatible
> section of the release notes. :-)

Right, loud and clear. And the same for values returned from PL/Perl functions, right? They will no longer be returned as binary soup if you return a decoded value. I bet that's not at all common, though.

Best,

David