Re: plperlu problem with utf8

From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: David Christensen <david(at)endpoint(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plperlu problem with utf8
Date: 2010-12-19 08:20:27
Message-ID: AANLkTi=ODW=o2R7i2NOfj0tv6dqGCAsBcKesAiWYWMw4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david(at)kineticode(dot)com> wrote:
> ...
> I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function.

Yeah, I don't think you will find anyone who disagrees :) PL/TCL and
PL/Python get this right FWIW. Anyway find attached a patch that does
just this.

With the attached we:
- for function arguments, convert (using pg_do_encoding_conversion) to
utf8 from the current database encoding. We also turn on the utf8
flag so perl knows it was given utf8. Pre patch things only really
worked for SQL_ASCII or PG_UTF8 databases. In practice everything
worked fine for single byte charsets. However things like uc() or
lc() on bytes with high bits set were probably broken.

- for output from perl convert from perls internal format to utf8
(using SvPVutf8()), and then convert that to the current database
encoding. This sounds unoptimized, but in the common case SvPVutf8()
should be a noop. Pre patch this was "random" (dependent on how perl
decided to represent the string internally) but it worked 99% of the
time (at least in the single byte charset or UTF8 cases).

- fix errors so they properly encode their message to the current
database encoding (pre patch we were doing no conversion at all,
similar to the output case were it worked most of the time)

- always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in
plperl). Pre patch this only happened on a UTF8 database. That meant
multi-byte character regexs were broken on non utf8 databases.

-remove some old perl version checks for 5.6 and 5.8. We require
5.8.1 so these were nothing but clutter.

Something interesting to note is when we are SQL_ASCII,
pg_do_encoding_conversion() does nothing, yet we turn on the utf8
flag. This means if you pass in valid utf8 perl will treat it as
such. It also means on output it will hand utf8 back. Both PL/Tcl
and PL/Python do the same thing so I suppose its sensible to match
their behavior (and it was the lazy thing to do). The difference
being with PL/Python if you return said string you get "ERROR:
PL/Python: could not convert Python Unicode object to PostgreSQL
server encoding". While PL/Tcl and now Pl/perl give you back a utf8
version. For example:

(SQL_ASCII database)
=# select length('☺');
length
--------
3

=# CREATE FUNCTION tcl_len(text) returns text as $$ return [string
length $1] $$ language pltcl;
CREATE FUNCTION
postgres=# SELECT tcl_len('☺');
tcl_len
------------
1
(1 row)

=# CREATE FUNCTION py_len(str text) returns text as $$ return
len(str) $$ language plpython3;
=# SELECT py_len('☺');
py_len
--------
1
(1 row)

I wouldn't really say thats right, but its at least consistent...

This does *not* address the bytea issue where currently if you have
bytea input or output we try to encode that the same as any string. I
think thats going to be a bit more invasive and this patch should
stands on its own.

Attachment Content-Type Size
plperl_fix_enc.patch.gz application/x-gzip 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2010-12-19 09:02:39 Re: serializable lock consistency
Previous Message Jeff Janes 2010-12-19 07:59:33 Re: can shared cache be swapped to disk?