Re: Bad Data back Door

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Bad Data back Door
Date: 2012-10-06 00:54:55
Message-ID: A9E454DD-DB0A-4D1F-A6E6-888ED4FCFE4F@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

I’ve discovered something a bit disturbing at $work. We’re migrating (slowly) from Oracle to PostgreSQL, and in some cases are using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that, although the database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign table into a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though the database is UTF-8.

I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and therefor additional character encoding validation is skipped, yes?

If so, I’m wondering if it might be possible to add some sort of option to the CREATE FOREIGN TABLE statement to the effect that certain values should not be trusted to be in the encoding they say they are.

At any rate, I’m spending some quality time re-encoding bogus values I never expected to see in our systems. :-(

Thanks,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-06 01:12:01
Message-ID: 20303.1349485921@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)justatheory(dot)com> writes:
> Ive discovered something a bit disturbing at $work. Were migrating (slowly) from Oracle to PostgreSQL, and in some cases are using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that, although the database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign table into a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though the database is UTF-8.

> I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and therefor additional character encoding validation is skipped, yes?

Probably not so much "assumed" as "nobody thought about it". In
e.g. plperl we expend the cycles to do encoding validity checking on
*every* string entering the system from Perl. I'm not sure why foreign
tables ought to get a pass on that, especially when you consider the
communication overhead that the encoding check would be amortized
against.

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)

regards, tom lane


From: John R Pierce <pierce(at)hogranch(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bad Data back Door
Date: 2012-10-06 02:14:00
Message-ID: 506F93E8.5010406@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/05/12 6:12 PM, Tom Lane wrote:
> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

I'm not sure of that. what if the FDW is used to connect to (say) a
postgres database that is in POSIX/C ? is that checked for?

I'd like to see some encoding validation and substitution functions in
postgres. for instance, one that can take any supported encoding and
convert it to the database encoding and generate an error on any invalid
character. this translation could be identity (eg, UTF8->UTF8)
whereupon it would just validate. a 2nd function would do the same,
but replace errors with the substitution character in the target charset
and not error.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: John R Pierce <pierce(at)hogranch(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bad Data back Door
Date: 2012-10-06 10:45:23
Message-ID: 50700BC3.90004@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.10.2012 05:14, John R Pierce wrote:
> I'd like to see some encoding validation and substitution functions in
> postgres. for instance, one that can take any supported encoding and
> convert it to the database encoding and generate an error on any invalid
> character. this translation could be identity (eg, UTF8->UTF8) whereupon
> it would just validate.

See pg_any_to_server() in mbutils.c. At the SQL level, there's the
convert(bytea, name, name) function.

> a 2nd function would do the same, but replace
> errors with the substitution character in the target charset and not error.

Hmm, I don't think we have that.

- Heikki


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-06 17:34:36
Message-ID: A1176DA3-B32B-4396-A820-78E7CF9C9C0F@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Probably not so much "assumed" as "nobody thought about it". In
> e.g. plperl we expend the cycles to do encoding validity checking on
> *every* string entering the system from Perl. I'm not sure why foreign
> tables ought to get a pass on that, especially when you consider the
> communication overhead that the encoding check would be amortized
> against.

Yes, that’s what I was thinking.

> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding of those text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQL to provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validate values before displaying them.

Best,

David


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-06 17:37:02
Message-ID: CAOeZVie_XK=p3f2NamX91fBQwzAzu4Gk_Wubcn8aybeWwXxO9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 6, 2012 at 1:34 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Probably not so much "assumed" as "nobody thought about it". In
>> e.g. plperl we expend the cycles to do encoding validity checking on
>> *every* string entering the system from Perl. I'm not sure why foreign
>> tables ought to get a pass on that, especially when you consider the
>> communication overhead that the encoding check would be amortized
>> against.
>
> Yes, that’s what I was thinking.
>
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)
>
> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding of those text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQL to provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validate values before displaying them.
>
> Best,
>
> David
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

+1
--
Regards,

Atri
l'apprenant


From: John R Pierce <pierce(at)hogranch(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bad Data back Door
Date: 2012-10-06 19:13:02
Message-ID: 507082BE.90307@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/06/12 3:45 AM, Heikki Linnakangas wrote:
> At the SQL level, there's the convert(bytea, name, name) function.

ahhh, right. (forehead slap)

>
>> a 2nd function would do the same, but replace
>> errors with the substitution character in the target charset and not
>> error.
>
> Hmm, I don't think we have that.

me thinks this would be extremely useful for importing 'dirty' data.
that or a per-connection flag (or option on the COPY command?) that
said "substitute-on-error" for the likes of UTF-8 imports from CSV.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-06 19:35:19
Message-ID: 7437.1349552119@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)justatheory(dot)com> writes:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)

> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, its Oracle thats lying about the encoding of those text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQL to provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validate values before displaying them.

It is the FDW's responsibility to deal with this. We expect it to hand
back valid tuples; it is not reasonable to disassemble them looking for
mistakes (and we couldn't catch most mistakes, anyway). If the
interface were defined in terms of text, we could do checking above the
FDW level ... but it isn't.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-06 20:03:19
Message-ID: 50708E87.6080300@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 10/06/2012 03:35 PM, Tom Lane wrote:
> "David E. Wheeler" <david(at)justatheory(dot)com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the FDW
>>> to apply any required check ... which makes this a bug report against
>>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
>>> COPY code, which will check encoding.)
>> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding of those text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQL to provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validate values before displaying them.
> It is the FDW's responsibility to deal with this. We expect it to hand
> back valid tuples; it is not reasonable to disassemble them looking for
> mistakes (and we couldn't catch most mistakes, anyway). If the
> interface were defined in terms of text, we could do checking above the
> FDW level ... but it isn't.
>
>

Exactly.

We've done quite a lot of work tightening the ways that badly encoded
data can enter the database over the years. It's a never ending game of
whack-a-mole. There aren't any easy answers.

cheers

andrew


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Tom Lane *EXTERN*" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 07:25:52
Message-ID: D960CB61B694CF459DCFB4B0128514C20886A6F4@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "David E. Wheeler" <david(at)justatheory(dot)com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the
FDW
>>> to apply any required check ... which makes this a bug report
against
>>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on
the
>>> COPY code, which will check encoding.)

>> I agree that this is a bug in oracle_fdw (well, potentially;
ultimately,
>> it's Oracle that's lying about the encoding of those text values).
>> But I think that it would be much more useful overall -- not
>> to mention more database-like -- for PostgreSQL to provide a way to
>> enforce it. That is, to consider foreign tables to be an input like
>> COPY or SQL, and to validate values before displaying them.

> It is the FDW's responsibility to deal with this. We expect it to
hand
> back valid tuples; it is not reasonable to disassemble them looking
for
> mistakes (and we couldn't catch most mistakes, anyway). If the
> interface were defined in terms of text, we could do checking above
the
> FDW level ... but it isn't.

As the author I agree that this is a bug in oracle_fdw.

This was caused by ignorance on my part: I had assumed that the
type input functions would perform the necessary checks, but it
seems like that is not the case. I'll look into it.

Oracle does not care much about correct encoding.
If client character set and database character set are the same,
Oracle does not bother to check the data. This is probably how
WINDOWS-1252 characters slipped into the UTF-8 database in question.
I consider this a bug in Oracle, but never reported it, because
I don't have much hope that Oracle would see it as a problem
given their habitually sloppy handling of encoding issues.

Yours,
Laurenz Albe


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: "Tom Lane *EXTERN*" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 07:30:49
Message-ID: 43FA5AD0-F823-4906-974E-C37022E6B240@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 8, 2012, at 12:25 AM, "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> wrote:

> As the author I agree that this is a bug in oracle_fdw.

Thanks. Should I file a report somewhere?

> This was caused by ignorance on my part: I had assumed that the
> type input functions would perform the necessary checks, but it
> seems like that is not the case. I'll look into it.

Thank you!

> Oracle does not care much about correct encoding.
> If client character set and database character set are the same,
> Oracle does not bother to check the data. This is probably how
> WINDOWS-1252 characters slipped into the UTF-8 database in question.
> I consider this a bug in Oracle, but never reported it, because
> I don't have much hope that Oracle would see it as a problem
> given their habitually sloppy handling of encoding issues.

Yeah, same here. I've been looking into write a function to try to fix poorly-encoded data, though, but haven't got far, because CONVERT() does not indicate failure. If you have any insight on this, I'd appreciate your thoughts on this Stack Overflow question:

http://stackoverflow.com/q/12717363/79202

Thanks,

David


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 17:56:58
Message-ID: 76401FB0-45A6-4E7A-A141-7B35A740833A@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

FWIW, I believe that dblink does not check encoding.

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 18:13:52
Message-ID: 13492.1349720032@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)justatheory(dot)com> writes:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)

> FWIW, I believe that dblink does not check encoding.

In dblink's case, that boils down to trusting a remote instance of
Postgres to get this right, which doesn't seem totally unreasonable.
But I wouldn't object to adding checks there if someone wanted to submit
a patch.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 18:23:05
Message-ID: CDFDCFD7-4EFA-437A-9424-B04AD0B2C021@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 8, 2012, at 11:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> FWIW, I believe that dblink does not check encoding.
>
> In dblink's case, that boils down to trusting a remote instance of
> Postgres to get this right, which doesn't seem totally unreasonable.
> But I wouldn't object to adding checks there if someone wanted to submit
> a patch.

Yeah, I found this because we had a dblink to another PostgreSQL server's table with data populated from oracle_fdw. I guess trusting is reasonable, though.

I wonder about dbi-link, though…

David


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-08 18:29:07
Message-ID: 50731B73.1030408@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 10/08/2012 02:13 PM, Tom Lane wrote:
> "David E. Wheeler" <david(at)justatheory(dot)com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the FDW
>>> to apply any required check ... which makes this a bug report against
>>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the
>>> COPY code, which will check encoding.)
>> FWIW, I believe that dblink does not check encoding.
> In dblink's case, that boils down to trusting a remote instance of
> Postgres to get this right, which doesn't seem totally unreasonable.
> But I wouldn't object to adding checks there if someone wanted to submit
> a patch.

It does do:

PQsetClientEncoding(conn, GetDatabaseEncodingName());

I'd be mildly reluctant to do anything more except possibly as an
option, unless it could be shown to have minimal performance impact.

cheers

andrew


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "David E(dot) Wheeler *EXTERN*" <david(at)justatheory(dot)com>
Cc: "Tom Lane *EXTERN*" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-09 07:16:11
Message-ID: D960CB61B694CF459DCFB4B0128514C20886AA11@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:
>> As the author I agree that this is a bug in oracle_fdw.

> Thanks. Should I file a report somewhere?

That's not necessary. Thanks for reporting the problem.
It may be a few days until I get around to fix that.

>> Oracle does not care much about correct encoding.

> Yeah, same here. I've been looking into write a function to try
> to fix poorly-encoded data, though, but haven't got far,
> because CONVERT() does not indicate failure. If you have
> any insight on this, I'd appreciate your thoughts on this
> Stack Overflow question:
>
> http://stackoverflow.com/q/12717363/79202

The only thing I can think of is a stored procedure in Java.
You could use java.nio.charset.CharsetEncoder and
java.nio.charset.CharsetDecoder which should throw exceptions
if they encounter illegal bytes.

Yours,
Laurenz Albe


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Tom Lane *EXTERN*" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bad Data back Door
Date: 2012-10-09 12:48:09
Message-ID: D960CB61B694CF459DCFB4B0128514C20886AB63@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Tom Lane wrote:
>>>> Now, having said that, I think it has to be the reponsibility of
the FDW
>>>> to apply any required check ... which makes this a bug report
against
>>>> oracle_fdw, not the core system.

> As the author I agree that this is a bug in oracle_fdw.

Ok, fixed.
David, could you try the latest version from CVS to see
if it works for you?

Yours,
Laurenz Albe