Re: plperlu problem with utf8

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-18 01:04:47
Message-ID: ACAB08D1-A956-44F5-8200-36EB0404DD6A@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec 16, 2010, at 8:39 PM, Alex Hunsaker wrote:

>> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
>
> So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Not knowing what those mean, I'm not saying either one, to my knowledge. What I understand, however, is that Perl, given a scalar with bytes in it, will treat it as latin-1 unless the utf8 flag is turned on.

> Im saying they are not, and if you want \xc3\xa9 to be treated as
> chr(233) you need to tell perl what encoding the string is in (err
> well actually decode it so its in "perl space" as unicode characters
> correctly).

PostgreSQL should do everything it can to decode to Perl's internal format before passing arguments, and to decode from Perl's internal format on output.

>> Maybe I'm misunderstanding, but it seems to me that:
>>
>> * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.
>
> Currently postgres has 2 behaviors:
> 1) If the database is utf8, turn on the utf8 flag. According to the
> perldoc snippet I quoted this should mean its a sequence of utf8 bytes
> and should interpret it as such.

Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?

> 2) its not utf8, so we just leave it as octets.

Which mean's Perl will assume that it's Latin-1, IIUC.

> So in "perl space" length($_[0]) returns the number of characters when
> you pass in a multibyte char *not* the number of bytes. Which is
> correct, so um check we do that. Right?

Yeah. So I just wrote and tested this function on 9.0 with Perl 5.12.2:

CREATE OR REPLACE FUNCTION perlgets(
TEXT
) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
my $text = shift;
return_next {
length => length $text,
is_utf8 => utf8::is_utf8($text) ? 1 : 0
};
$$;

In a utf-8 database:

utf8=# select * from perlgets('foo');
length │ is_utf8
────────┼─────────
8 │ t
(1 row)

In a latin-1 database:

latin=# select * from perlgets('foo');
length │ is_utf8
────────┼─────────
8 │ f
(1 row)

I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.

Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.

> In the URI::Escape example we have:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use URI::Escape;
> warn(length($_[0]));
> return uri_unescape($_[0]); $$ LANGUAGE plperlu;
>
> # select url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING: 38 at line 2

What's the output? And what's the encoding of the database?

> Ok that length looks right, just for grins lets try add one multibyte char:
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
> WARNING: 39 CONTEXT: PL/Perl function "url_decode" at line 2.
> url_decode
> -------------------------------
> comment passer le réveillon☺
> (1 row)
>
> Still right,

The length is right, but the é is wrong. It looks like Perl thinks it's latin-1. Or, rather, unescape_uri() dosn't know that it should be returning utf-8 characters. That *might* be a bug in URI::Escape.

> now lets try the utf8::decode version that "works". Only
> lets look at the length of the string we are returning instead of the
> one we are passing in:
>
> # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$
> use URI::Escape;
> utf8::decode($_[0]);
> my $str = uri_unescape($_[0]);
> warn(length($str));
> return $str;
> $$ LANGUAGE plperlu;
>
> # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> url_decode
> -----------------------------
> comment passer le réveillon
> (1 row)
>
> Looks harmless enough...

Looks far better, in fact. Interesting that URI::Escape does the right thing only if the utf8 flag has been turned on in the string passed to it. But in Perl it usually won't be, because the encoded string should generally have only ASCII characters.

> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> length
> --------
> 27
> (1 row)
>
> Wait a minute... those lengths should match.
>
> Post patch they do:
> # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
> WARNING: 28 at line 5.
> CONTEXT: PL/Perl function "url_decode"
> length
> --------
> 28
> (1 row)
>
> Still confused? Yeah me too.

Yeah…

> Maybe this will help:
>
> #!/usr/bin/perl
> use URI::Escape;
> my $str = uri_unescape("%c3%a9");
> die "first match" if($str =~ m/\xe9/);
> utf8::decode($str);
> die "2nd match" if($str =~ m/\xe9/);
>
> gives:
> $ perl t.pl
> 2nd match at t.pl line 6.
>
> see? Either uri_unescape() should be decoding that utf8() or you need
> to do it *after* you call uri_unescape(). Hence the maybe it could be
> considered a bug in uri_unescape().

Agreed.

>> * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
>> I didn't really follow all of the above; are you aiming for the same thing?
>
> Yeah, the patch address this part. Right now we just spit out
> whatever the internal format happens to be.

Ah, excellent.

> Anyway its all probably clear as mud, this part of perl is one of the
> hardest IMO.

No question.

Best,

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2010-12-18 01:22:16 Re: plperlu problem with utf8
Previous Message Josh Berkus 2010-12-18 00:35:50 Re: Why don't we accept exponential format for integers?