Re: Careful PL/Perl Release Not Required

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Careful PL/Perl Release Not Required
Date: 2011-02-11 18:04:57
Message-ID: 0DA44369-C0F1-4C9D-A158-48688D37A6CC@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Feb 11, 2011, at 9:44 AM, Alex Hunsaker wrote:

> It is decoded... the input string "%C3%A9" actually is the _same_
> string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
> characters. Calling utf8::decode("%C3%A9") is essentially a noop.

No, it's not decoded. It doesn't matter because they're ASCII bytes. But if the utf8 flag isn't set, it's not decoded. It's just byte soup as far as Perl is concerned. Unless I grossly misunderstand something, which is entirely possible.

> Ok, I think i figured out why we seem to be talking past each other, we have:
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> That *looks* like it is decoding the input string, which it is, but
> actually that will double utf8 encode your string. It does not seem to
> in this case because we are dealing with all ascii input. The trick
> here is its also telling perl to decode/treat the *output* string as
> utf8.
>
> uri_unescape() returns the same string you passed in, which thanks to
> the utf8::decode() above has the utf8 flag set. Meaning we end up
> treating it as 1 character instead of two. Or basically that it has
> the same effect as calling utf8::decode() on the return value.
>
> The correct way to write that function pre 9.1 and post 9.1 would be
> (in a utf8 database):
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> my $str = uri_unescape($_[0]);
> utf8::decode($str);
> return $str;
> $$ LANGUAGE plperlu;
>
> The last utf8::decode being optional (as we said, it might not be
> utf8), but granting the sought behavior by the op.

No. If the argument to PL/Perl has the utf8 flag set, then that's what you always get. The utf8::decode() isn't necessary because it's already decoded:

> perl -MURI::Escape -MEncode -E 'say utf8::is_utf8(uri_unescape(Encode::decode_utf8("“hi”")))'
1

Best,

David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-02-11 18:06:12 Re: ALTER EXTENSION UPGRADE, v3
Previous Message Josh Berkus 2011-02-11 18:02:20 Re: Range Types: << >> -|- ops vs empty range