Re: Careful PL/Perl Release Not Required

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Careful PL/Perl Release Not Required
Date: 2011-02-11 17:44:56
Message-ID: AANLkTimkORLgN6ib63rkZ9OjZv5jDpBpe8E+OEk-oXL-@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 11, 2011 at 10:16, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> On Feb 10, 2011, at 11:43 PM, Alex Hunsaker wrote:

> Like I said, the terminology is awful.

Yeah I use encode and decode to mean the same thing frequently :-(.

>> In the the cited case he was passing "%C3%A9" to uri_unescape() and
>> expecting it to return 1 character. The additional utf8::decode() will
>> tell perl the string is in utf8 so it will then return 1 char. The
>> point being, decode is needed and with it, the function will work pre
>> and post 9.1.
>
> Why wouldn't the string be decoded already when it's passed to the function, as it would be in 9.0 if the database was utf-8, and should be in 9.1 if the database isn't sql_ascii?

It is decoded... the input string "%C3%A9" actually is the _same_
string utf-8, latin1 and SQL_ASCII decoded or not. Those are all ascii
characters. Calling utf8::decode("%C3%A9") is essentially a noop.

>> In-fact on a latin-1 database it sure as heck better return two
>> characters, it would be a bug if it only returned 1 as that would mean
>> it would be treating a series of latin1 bytes as a series of utf8
>> bytes!
>
> If it's a latin-1 database, in 9.1, the argument should be passed decoded. That's not a utf-8 string or bytes. It's Perl's internal representation.

> If I understand the patch correctly, the decode() will no longer be needed. The string will *already* be decoded.

Ok, I think i figured out why we seem to be talking past each other, we have:
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
utf8::decode($_[0]);
return uri_unescape($_[0]); $$ LANGUAGE plperlu;

That *looks* like it is decoding the input string, which it is, but
actually that will double utf8 encode your string. It does not seem to
in this case because we are dealing with all ascii input. The trick
here is its also telling perl to decode/treat the *output* string as
utf8.

uri_unescape() returns the same string you passed in, which thanks to
the utf8::decode() above has the utf8 flag set. Meaning we end up
treating it as 1 character instead of two. Or basically that it has
the same effect as calling utf8::decode() on the return value.

The correct way to write that function pre 9.1 and post 9.1 would be
(in a utf8 database):
CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
use strict;
use URI::Escape;
my $str = uri_unescape($_[0]);
utf8::decode($str);
return $str;
$$ LANGUAGE plperlu;

The last utf8::decode being optional (as we said, it might not be
utf8), but granting the sought behavior by the op.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-02-11 17:46:25 Re: Range Types: << >> -|- ops vs empty range
Previous Message Robert Haas 2011-02-11 17:36:09 Re: Range Types: << >> -|- ops vs empty range