Re: plperlu problem with utf8

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: David Christensen <david(at)endpoint(dot)com>
Cc: Alex Hunsaker <badalex(at)gmail(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-19 03:29:34
Message-ID: BBD0B93C-7B24-4C36-A44E-4863EC98CE6A@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Dec 17, 2010, at 9:32 PM, David Christensen wrote:

> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. In the case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardless of data type. For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8, presumably using the built-in conversions before passing it off to the first code path. A similar handling would need to be done for the return values, again datatype-dependent.

+1

> Recent upgrades of the Encode module included with perl 5.10+ have caused issues wherein circular dependencies between Encode and Encode::Alias have made it impossible to load in a Safe container without major pain. (There may be some better options than I'd had on a previous project, given that we're embedding our own interpreters and accessing more through the XS guts, so I'm not ruling out this possibility completely).

Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm.

>> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?
>
> I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure the parameters came in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify the string-like types/text data. (Perhaps any type which had a binary cast to text would be a sufficient definition here. Do domains automatically inherit binary casts from their originating types?)

Their labels are TEXT. I believe that the only type that should not be treated as text is bytea.

>>> 2) its not utf8, so we just leave it as octets.
>>
>> Which mean's Perl will assume that it's Latin-1, IIUC.
>
> This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier. This would produce bogus results for any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general usage.

Agreed.

> This example seems bogus; wouldn't length be 3 if this is the example text this was run with? Additionally, since all ASCII is trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was improperly handled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit) < length(encode_utf8($hi_bit)). I don't see that this test shows us much with the test case as given. The is_utf8() function merely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set on ascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit characters in the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal format) with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though the underlying data, as represented in perl will be identical).

Sorry, I probably had a pasto there. how about this?

CREATE OR REPLACE FUNCTION perlgets(
TEXT
) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
my $text = shift;
return_next {
length => length $text,
is_utf8 => utf8::is_utf8($text) ? 1 : 0
};
$$;

utf8=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
7 │ t

latin=# SELECT * FROM perlgets('“hello”');
length │ is_utf8
────────┼─────────
11 │ f

(Yes I used Latin-1 curly quotes in that last example). I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.

>
>> In a latin-1 database:
>>
>> latin=# select * from perlgets('foo');
>> length │ is_utf8
>> ────────┼─────────
>> 8 │ f
>> (1 row)
>>
>> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl's internal form.
>
> See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case) data that has not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation for the state of that flag.

Right. I think that it *should* be decoded.

>> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database. That doesn't seem right to me.
>
> I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do no encoding on the data itself, which would indicate that the utf8 flag for that variable would be unset.

Right.

> If this is not currently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no character semantics attached to it.

It looks as though it is not handled that way. The utf8 flag *is* set on a bytea string passed to a PL/Perl function in a UTF-8 database.

> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded version is 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle > 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8 first and then escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8 into perl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1).

Right.

> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it. Perhaps later versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is only defined to return octets.

Right, I think we're agreed on that count. I wouldn't mind seeing a uri_unescape_utf8() though, as it might prevent some confusion.

>>> Yeah, the patch address this part. Right now we just spit out
>>> whatever the internal format happens to be.
>>
>> Ah, excellent.
>
> I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return (-> server_encoding). This should be for any character-type data insofar as it is feasible, but ISTR there is already datatype-specific marshaling occurring.

Dunno about that.

> There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a few things up.

Yes, it helped, thanks!

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2010-12-19 03:31:57 Re: Extensions, patch v20 (bitrot fixes) (was: Extensions, patch v19 (encoding brainfart fix))
Previous Message Robert Haas 2010-12-19 03:18:37 Re: SQL/MED - file_fdw