Quick Links

Re: invalid UTF-8 via pl/perl

Lists:	pgsql-hackers

From:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	invalid UTF-8 via pl/perl
Date:	2010-01-02 22:21:51
Message-ID:	1262470911.1813.16.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It is possible to get an invalid byte sequence into a text field via pl,
in this case pl/perl :

---8<------8<------8<------8<------8<------8<---
CREATE TABLE utf_test
(
id serial PRIMARY KEY,
data character varying
);

CREATE OR REPLACE FUNCTION invalid_utf_seq()
RETURNS character varying AS
$BODY$
return "\xd0";
$BODY$
LANGUAGE 'plperlu' VOLATILE STRICT;

insert into utf_test(data) values(invalid_utf_seq());
---8<------8<------8<------8<------8<------8<---

This results in a table, which has invalid utf sequence in it and
consequently does not pass dump/load

What would be the best place to fix this ?

Should there be checks in all text types ?
(probably too expensive)

Or should pl/perl check it's return values for compliance with
server_encoding ?

Or should postgresql itself check that pl-s return what they promise to
return ?

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-02 22:50:40
Message-ID:	4B3FCDC0.1000902@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:

[plperl can return data that is not valid in the server encoding and it
is not caught]

> This results in a table, which has invalid utf sequence in it and
> consequently does not pass dump/load
>
> What would be the best place to fix this ?
>
> Should there be checks in all text types ?
> (probably too expensive)
>

The plperl code has no type-specific checks, and in any case limiting it
to "text" types would defeat third party and contrib types of which it
knows nothing (think citext). We should check all strings returned by
plperl.
> Or should pl/perl check it's return values for compliance with
> server_encoding ?
>

I think the plperl glue code should check returned strings using
pg_verifymbstr().

> Or should postgresql itself check that pl-s return what they promise to
> return ?
>
>

There is no central place for it to check. The pl glue code is the right
place, I think.

cheers

andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 01:51:22
Message-ID:	4B3FF81A.9000408@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:
>
> I think the plperl glue code should check returned strings using
> pg_verifymbstr().
>
>

Please test this patch. I think we'd probably want to trap the encoding
error and issue a customised error message, but this plugs all the holes
I can see with the possible exception of values inserted via SPI calls.
I'll check that out.

cheers

andrew

Attachment	Content-Type	Size
perl-utf8.patch	text/x-patch	2.2 KB

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 14:28:29
Message-ID:	4B40A98D.6000307@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:
>
>
> Andrew Dunstan wrote:
>>
>> I think the plperl glue code should check returned strings using
>> pg_verifymbstr().
>>
>>
>
> Please test this patch. I think we'd probably want to trap the
> encoding error and issue a customised error message, but this plugs
> all the holes I can see with the possible exception of values inserted
> via SPI calls. I'll check that out.
>
>

I think the attached patch plugs the direct SPI holes as well.

One thing that I am pondering is: how does SPI handle things if the
client encoding and server encoding are not the same? Won't the strings
it passes the parser be interpreted in the client encoding? If so, that
doesn't seem right at all, since these strings come from a server side
call and not from the client at all. It looks to me like the call to
pg_parse_query() in spi.c should possibly be surrounded by code to
temporarily set the client encoding to the server encoding and then
restore it afterwards.

cheers

andrew

Attachment	Content-Type	Size
perl-utf8-2.patch	text/x-patch	3.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 17:17:38
Message-ID:	5388.1262539058@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> One thing that I am pondering is: how does SPI handle things if the
> client encoding and server encoding are not the same?

What? client_encoding is not used anywhere within the backend.
Everything should be server_encoding.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 17:44:32
Message-ID:	4B40D780.4090302@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> One thing that I am pondering is: how does SPI handle things if the
>> client encoding and server encoding are not the same?
>>
>
> What? client_encoding is not used anywhere within the backend.
> Everything should be server_encoding.
>
>
>

Oh, for some reason I thought the translation was done in the scanner.
Sorry for the noise.

cheers

andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 19:54:53
Message-ID:	4B40F60D.5050200@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
>
> I think the attached patch plugs the direct SPI holes as well.

There are two issues with this patch. First, how far if at all should it
be backpatched? All the way, or 8.3, where we tightened the encoding
rules, or not at all?

Second, It produces errors like this:

andrew=# select 'a' || invalid_utf_seq() || 'b';
ERROR: invalid byte sequence for encoding "UTF8": 0xd0
HINT: This error can also happen if the byte sequence does not
match the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: PL/Perl function "invalid_utf_seq"
andrew=#

That hint seems rather misleading. I'm not sure what we can do about it
though. If we set the noError param on pg_verifymbstr() we would miss
the error message that actually identified the bad data, so that doesn't
seem like a good plan.

cheers

andrew

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 20:09:39
Message-ID:	D3E7DCCB-9ACE-4D02-B684-594BF0B61529@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 3, 2010, at 11:54 AM, Andrew Dunstan wrote:

> There are two issues with this patch. First, how far if at all should it be backpatched? All the way, or 8.3, where we tightened the encoding rules, or not at all?

8.3 seems reasonable.

> Second, It produces errors like this:
>
> andrew=# select 'a' || invalid_utf_seq() || 'b';
> ERROR: invalid byte sequence for encoding "UTF8": 0xd0
> HINT: This error can also happen if the byte sequence does not
> match the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT: PL/Perl function "invalid_utf_seq"
> andrew=#
>
>
> That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on pg_verifymbstr() we would miss the error message that actually identified the bad data, so that doesn't seem like a good plan.

I'm sure I'm just revealing my ignorance here, but how is the hint misleading?

Best,

David

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 20:14:05
Message-ID:	4B40FA8D.2030109@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler wrote:
>> Second, It produces errors like this:
>>
>> andrew=# select 'a' || invalid_utf_seq() || 'b';
>> ERROR: invalid byte sequence for encoding "UTF8": 0xd0
>> HINT: This error can also happen if the byte sequence does not
>> match the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT: PL/Perl function "invalid_utf_seq"
>> andrew=#
>>
>>
>> That hint seems rather misleading. I'm not sure what we can do about it though. If we set the noError param on pg_verifymbstr() we would miss the error message that actually identified the bad data, so that doesn't seem like a good plan.
>>
>
> I'm sure I'm just revealing my ignorance here, but how is the hint misleading?
>
>
>

The string that causes the trouble does not come from the client and has
nothing to do with client_encoding.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 20:15:11
Message-ID:	7573.1262549711@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> andrew=# select 'a' || invalid_utf_seq() || 'b';
> ERROR: invalid byte sequence for encoding "UTF8": 0xd0
> HINT: This error can also happen if the byte sequence does not
> match the encoding expected by the server, which is controlled by
> "client_encoding".
> CONTEXT: PL/Perl function "invalid_utf_seq"

> That hint seems rather misleading. I'm not sure what we can do about it
> though. If we set the noError param on pg_verifymbstr() we would miss
> the error message that actually identified the bad data, so that doesn't
> seem like a good plan.

Yeah, we want the detailed error info. The problem is that the hint is
targeted to the case where we are checking data coming from the client.
We could add another parameter to pg_verifymbstr to indicate the
context, perhaps. I'm not sure how to do it exactly --- just a bool
that suppresses the hint, or do we want to make a provision for some
other hint or detail message?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 20:17:11
Message-ID:	7619.1262549831@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> There are two issues with this patch. First, how far if at all should it
> be backpatched? All the way, or 8.3, where we tightened the encoding
> rules, or not at all?

Forgot to mention --- I'm not in favor of backpatching. First because
tightening encoding verification has been a process over multiple
releases; it's not a bug fix in the normal sense of the word, and might
break things that people had been doing without trouble. Second because
I think we'll have to change pg_verifymbstr's API, and that's not
something to back-patch if we can avoid it.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-03 23:40:40
Message-ID:	4B412AF8.8070700@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> andrew=# select 'a' || invalid_utf_seq() || 'b';
>> ERROR: invalid byte sequence for encoding "UTF8": 0xd0
>> HINT: This error can also happen if the byte sequence does not
>> match the encoding expected by the server, which is controlled by
>> "client_encoding".
>> CONTEXT: PL/Perl function "invalid_utf_seq"
>>
>
>
>> That hint seems rather misleading. I'm not sure what we can do about it
>> though. If we set the noError param on pg_verifymbstr() we would miss
>> the error message that actually identified the bad data, so that doesn't
>> seem like a good plan.
>>
>
> Yeah, we want the detailed error info. The problem is that the hint is
> targeted to the case where we are checking data coming from the client.
> We could add another parameter to pg_verifymbstr to indicate the
> context, perhaps. I'm not sure how to do it exactly --- just a bool
> that suppresses the hint, or do we want to make a provision for some
> other hint or detail message?
>
>
>

Or instead of another param we could change the third param to be one of
(NO_ERROR, CLIENT_ERROR, SERVER_ERROR) or some such.

Or we could just add another verify func. I don't have terribly strong
opinions about it.

Incidentally, I guess we need to look at plpython and pltcl for similar
issues.

cheers

andrew

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-04 10:46:52
Message-ID:	1262602012.31658.1.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On sön, 2010-01-03 at 18:40 -0500, Andrew Dunstan wrote:
> Incidentally, I guess we need to look at plpython and pltcl for
> similar issues.

I confirm that the same issue exists in plpython.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-04 14:44:53
Message-ID:	4B41FEE5.6060909@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is a mess. It affects four or five levels of visible functions that
are called in about 18 files.

How about we just change the hint so it also refers to the possibility
that the data comes from a PL? That would save lots of trouble.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-01-04 15:11:24
Message-ID:	10301.1262617884@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> How about we just change the hint so it also refers to the possibility
> that the data comes from a PL? That would save lots of trouble.

Maybe just lose the hint altogether. It's not adding that much,
and I seem to recall that there have already been complaints about
other cases where it's misleading.

regards, tom lane

From:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-03-08 22:06:21
Message-ID:	1268085981.2855.13.camel@hvost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-01-02 at 20:51 -0500, Andrew Dunstan wrote:
>
> Andrew Dunstan wrote:
> >
> > I think the plperl glue code should check returned strings using
> > pg_verifymbstr().
> >
> >
>
> Please test this patch. I think we'd probably want to trap the encoding
> error and issue a customised error message, but this plugs all the holes
> I can see with the possible exception of values inserted via SPI calls.
> I'll check that out.

I got a report, that the patch fixes one case but leaves open another:

CREATE TABLE utf_test
(
id serial PRIMARY KEY,
data character varying
);

CREATE OR REPLACE FUNCTION utf_test()
RETURNS character varying AS
$BODY$
return "\xd0";
$BODY$
LANGUAGE 'plperlu' VOLATILE STRICT;

CREATE OR REPLACE FUNCTION utf_test2()
RETURNS character varying AS
$BODY$
spi_exec_query("insert into utf_test (data) values('\xd0');");
return "VIGA";
$BODY$
LANGUAGE 'plperlu' VOLATILE STRICT;

The report siad, that patch fixes case

insert into utf_test (data) values(utf_test());

so that it return an error, but the second function

select utf_test2();

still enters wrong data to the table

So SPI interface should also be fixed, either from perl side, or maybe
from inside SPI ?

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-03-09 00:11:21
Message-ID:	25308.1268093481@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing <hannu(at)2ndquadrant(dot)com> writes:
> So SPI interface should also be fixed, either from perl side, or maybe
> from inside SPI ?

SPI has every right to assume that data it's given is already in the
database encoding.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-03-09 02:52:57
Message-ID:	4B95B809.7000407@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Hannu Krosing <hannu(at)2ndquadrant(dot)com> writes:
>
>> So SPI interface should also be fixed, either from perl side, or maybe
>> from inside SPI ?
>>
>
> SPI has every right to assume that data it's given is already in the
> database encoding.
>
>
>

Yeah, looks like we missed a few spots. I have added three more checks
that I think plug the remaining holes in plperl.

Hannu, please test again against CVS HEAD.

cheers

andrew

From:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-03-12 14:32:19
Message-ID:	1268404339.32436.41.camel@hvost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote:
>
> Tom Lane wrote:
> > Hannu Krosing <hannu(at)2ndquadrant(dot)com> writes:
> >
> >> So SPI interface should also be fixed, either from perl side, or maybe
> >> from inside SPI ?
> >>
> >
> > SPI has every right to assume that data it's given is already in the
> > database encoding.
> >
> >
> >
>
> Yeah, looks like we missed a few spots. I have added three more checks
> that I think plug the remaining holes in plperl.
>
> Hannu, please test again against CVS HEAD.

Seems to work now

Do you plan to back-port this ?

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: invalid UTF-8 via pl/perl
Date:	2010-03-12 15:25:36
Message-ID:	4B9A5CF0.7000508@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> On Mon, 2010-03-08 at 21:52 -0500, Andrew Dunstan wrote:
>
>> Tom Lane wrote:
>>
>>> Hannu Krosing <hannu(at)2ndquadrant(dot)com> writes:
>>>
>>>
>>>> So SPI interface should also be fixed, either from perl side, or maybe
>>>> from inside SPI ?
>>>>
>>>>
>>> SPI has every right to assume that data it's given is already in the
>>> database encoding.
>>>
>>>
>>>
>>>
>> Yeah, looks like we missed a few spots. I have added three more checks
>> that I think plug the remaining holes in plperl.
>>
>> Hannu, please test again against CVS HEAD.
>>
>
> Seems to work now
>
> Do you plan to back-port this ?
>

I wasn't going to. The previous fixes weren't backpatched either, and in
general when we have plugged encoding holes the changes have not been
backpatched, on the grounds that it would be a behaviour change, e.g.
when we tightened things a lot for 8.3.

I think there an outstanding TODO to plug the other PLs, however. It's a
pity it has to be done over and over for each PL. Maybe we need some new
versions of some of the SPI calls that would do the checking so it could
be centralized.

cheers

andrew