From: | "David E(dot) Wheeler" <david(at)kineticode(dot)com> |
---|---|
To: | Alex Hunsaker <badalex(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Careful PL/Perl Release Not Required |
Date: | 2011-02-11 18:04:57 |
Message-ID: | 0DA44369-C0F1-4C9D-A158-48688D37A6CC@kineticode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Feb 11, 2011, at 9:44 AM, Alex Hunsaker wrote:
> It is decoded... the input string "%C3%A9" actually is the _same_
> string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
> characters. Calling utf8::decode("%C3%A9") is essentially a noop.
No, it's not decoded. It doesn't matter because they're ASCII bytes. But if the utf8 flag isn't set, it's not decoded. It's just byte soup as far as Perl is concerned. Unless I grossly misunderstand something, which is entirely possible.
> Ok, I think i figured out why we seem to be talking past each other, we have:
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> utf8::decode($_[0]);
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> That *looks* like it is decoding the input string, which it is, but
> actually that will double utf8 encode your string. It does not seem to
> in this case because we are dealing with all ascii input. The trick
> here is its also telling perl to decode/treat the *output* string as
> utf8.
>
> uri_unescape() returns the same string you passed in, which thanks to
> the utf8::decode() above has the utf8 flag set. Meaning we end up
> treating it as 1 character instead of two. Or basically that it has
> the same effect as calling utf8::decode() on the return value.
>
> The correct way to write that function pre 9.1 and post 9.1 would be
> (in a utf8 database):
> CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use strict;
> use URI::Escape;
> my $str = uri_unescape($_[0]);
> utf8::decode($str);
> return $str;
> $$ LANGUAGE plperlu;
>
> The last utf8::decode being optional (as we said, it might not be
> utf8), but granting the sought behavior by the op.
No. If the argument to PL/Perl has the utf8 flag set, then that's what you always get. The utf8::decode() isn't necessary because it's already decoded:
> perl -MURI::Escape -MEncode -E 'say utf8::is_utf8(uri_unescape(Encode::decode_utf8("“hi”")))'
1
Best,
David
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2011-02-11 18:06:12 | Re: ALTER EXTENSION UPGRADE, v3 |
Previous Message | Josh Berkus | 2011-02-11 18:02:20 | Re: Range Types: << >> -|- ops vs empty range |