Re: plperlu problem with utf8

From: David Christensen <david(at)endpoint(dot)com>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-20 04:00:02
Message-ID: A52064D4-3467-49B3-AE9E-14FF22B7A57D@endpoint.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Dec 19, 2010, at 2:20 AM, Alex Hunsaker wrote:

> On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david(at)kineticode(dot)com> wrote:
>> ...
>> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.
>
> Yeah, I don't think you will find anyone who disagrees :) PL/TCL and
> PL/Python get this right FWIW. Anyway find attached a patch that does
> just this.

Cool, thanks for taking this on.

> With the attached we:
> - for function arguments, convert (using pg_do_encoding_conversion) to
> utf8 from the current database encoding. We also turn on the utf8
> flag so perl knows it was given utf8. Pre patch things only really
> worked for SQL_ASCII or PG_UTF8 databases. In practice everything
> worked fine for single byte charsets. However things like uc() or
> lc() on bytes with high bits set were probably broken.

How does this deal with input records of composite type?

> - for output from perl convert from perls internal format to utf8
> (using SvPVutf8()), and then convert that to the current database
> encoding. This sounds unoptimized, but in the common case SvPVutf8()
> should be a noop. Pre patch this was "random" (dependent on how perl
> decided to represent the string internally) but it worked 99% of the
> time (at least in the single byte charset or UTF8 cases).
>
> - fix errors so they properly encode their message to the current
> database encoding (pre patch we were doing no conversion at all,
> similar to the output case were it worked most of the time)

This sounds good; I imagine in practice most errors contain just 7-bit ascii which should be acceptable in any server_encoding, but in the case where something is returned that is unable to be represented in the server_encoding (thinking values defined/used in the function itself), does it degrade to the current behavior, as opposed to fail or eat the error message without outputting?

> - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
> plperl). Pre patch this only happened on a UTF8 database. That meant
> multi-byte character regexs were broken on non utf8 databases.

This sounds good in general, but I think should be skipped if GetDatabaseEncoding() == SQL_ASCII.

> -remove some old perl version checks for 5.6 and 5.8. We require
> 5.8.1 so these were nothing but clutter.

+1. Can't complain about removing clutter :-).

> Something interesting to note is when we are SQL_ASCII,
> pg_do_encoding_conversion() does nothing, yet we turn on the utf8
> flag. This means if you pass in valid utf8 perl will treat it as
> such. It also means on output it will hand utf8 back. Both PL/Tcl
> and PL/Python do the same thing so I suppose its sensible to match
> their behavior (and it was the lazy thing to do). The difference
> being with PL/Python if you return said string you get "ERROR:
> PL/Python: could not convert Python Unicode object to PostgreSQL
> server encoding". While PL/Tcl and now Pl/perl give you back a utf8
> version. For example:
>
> (SQL_ASCII database)
> =# select length('☺');
> length
> --------
> 3
>
> =# CREATE FUNCTION tcl_len(text) returns text as $$ return [string
> length $1] $$ language pltcl;
> CREATE FUNCTION
> postgres=# SELECT tcl_len('☺');
> tcl_len
> ------------
> 1
> (1 row)
>
> =# CREATE FUNCTION py_len(str text) returns text as $$ return
> len(str) $$ language plpython3;
> =# SELECT py_len('☺');
> py_len
> --------
> 1
> (1 row)
>
> I wouldn't really say thats right, but its at least consistent...

I think this could/should be adequately handled by not calling the function when the DatabaseEncoding is SQL_ASCII. Since SQL_ASCII basically makes no assumptions about any representation other than arbitrary 8-bit encoding, this demonstrated behavior is more-or-less attaching incorrect semantics to values that are being returned, and is completely bunko IMHO. (Not that many people are hopefully running SQL_ASCII at this point, but you never know...) Also, I'd argue that pltcl and plpython should be fixed as well for the same reasons.

> This does *not* address the bytea issue where currently if you have
> bytea input or output we try to encode that the same as any string. I
> think thats going to be a bit more invasive and this patch should
> stands on its own.
> <plperl_fix_enc.patch.gz>

Yeah, I'm not sure how invasive that will end up being, or if there are other datatypes which should skip the text processing.

I noticed you moved the declaration of perl_sys_init_done; was that an independent bug fix, or did something in the patch require that?

Cheers,

David
--
David Christensen
End Point Corporation
david(at)endpoint(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-12-20 04:42:24 Re: Extensions and custom_variable_classes
Previous Message Bruce Momjian 2010-12-20 02:47:12 Re: pg_ctl and port number detection