Re: Support UTF-8 files with BOM in COPY FROM

Lists: pgsql-hackers
From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 04:58:42
Message-ID: CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
in COPY FROM command. BOM will be automatically detected and ignored
if the file encoding is UTF-8. WIP patch attached.

I'm thinking about only COPY FROM for reads, but if someone wants to add
BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

Comments welcome.

--
Itagaki Takahiro

Attachment Content-Type Size
copy_from_bom.patch application/octet-stream 747 bytes

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 06:14:03
Message-ID: E7BE1BF0-CAB4-413A-9DB8-21D5927905C5@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:

> I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
> in COPY FROM command. BOM will be automatically detected and ignored
> if the file encoding is UTF-8. WIP patch attached.

By my reading of http://unicode.org/faq/utf_bom.html#bom5, I'd say +1

So I think what you propose makes sense.

> I'm thinking about only COPY FROM for reads, but if someone wants to add
> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Best,

David


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 11:12:42
Message-ID: CABUevExwxVivjbyxdv4=R_JQP=POs=BoMgzsrL1OJ3CTMjaVDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> Hi,
>
> I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
> in COPY FROM command. BOM will be automatically detected and ignored
> if the file encoding is UTF-8. WIP patch attached.
>
> I'm thinking about only COPY FROM for reads, but if someone wants to add
> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.
>
> Comments welcome.

I like it in general. But if we're looking at the BOM, shouldn't we
also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
if the BOM claims it's UTF16?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 11:36:11
Message-ID: CAJW2+qdsfRcLY1q62ikH0dJdb8ujhPb6bu6j-rtSm+AAVrqtWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 20:12, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> I like it in general. But if we're looking at the BOM, shouldn't we
> also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
> if the BOM claims it's UTF16?

-1 because we're depending on manual configuration for now.
It would be reasonable if we had used automatic detection of
character encoding, but we don't. In addition, some crazy
encoding might use BOM codes as a valid character.

--
Itagaki Takahiro


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 11:47:54
Message-ID: CABUevEwNSAT28h8wN76A3q2edBKoBYU=ms-zi+1bzKGRS4aO0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 13:36, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> On Mon, Sep 26, 2011 at 20:12, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>> I like it in general. But if we're looking at the BOM, shouldn't we
>> also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
>> if the BOM claims it's UTF16?
>
> -1 because we're depending on manual configuration for now.
> It would be reasonable if we had used automatic detection of
> character encoding, but we don't. In addition, some crazy
> encoding might use BOM codes as a valid character.

Does such an encoding really exist? And the code only executes when
the user thinks he's in UTF8, right? So it would still only happen if
the incorrect encoding was specified..

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 12:06:10
Message-ID: 4E806AB2.5090200@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2011 07:12 AM, Magnus Hagander wrote:
> On Mon, Sep 26, 2011 at 06:58, Itagaki Takahiro
> <itagaki(dot)takahiro(at)gmail(dot)com> wrote:
>> Hi,
>>
>> I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
>> in COPY FROM command. BOM will be automatically detected and ignored
>> if the file encoding is UTF-8. WIP patch attached.
>>
>> I'm thinking about only COPY FROM for reads, but if someone wants to add
>> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.
>>
>> Comments welcome.
> I like it in general. But if we're looking at the BOM, shouldn't we
> also look and *reject* the file if it's a BOM for a non-UTF8 file? Say
> if the BOM claims it's UTF16?
>

It should be rejected as invalidly encoded anyway, as a non-utf8 BOM is
not valid utf-8. We shouldn't check in non-unicode cases where the
sequence might be valid in those encodings (e.g. ISO-8859-1).

cheers

andrew


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: itagaki(dot)takahiro(at)gmail(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 14:33:50
Message-ID: 20110926.233350.224883171232526681.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'd like to support UTF-8 text or csv files that has BOM (byte order mark)
> in COPY FROM command. BOM will be automatically detected and ignored
> if the file encoding is UTF-8. WIP patch attached.

>From RFC3629(http://tools.ietf.org/html/rfc3629#section-6):

o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those cases.

COPY explicitly specifies the encoding (to be UTF-8 in this case). So
I think we should not regard U+FEFF as "BOM" in COPY, rather we should
regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 14:44:38
Message-ID: 6146.1317048278@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ(at)mail(dot)gmail(dot)com> writes:
> On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:
>> I'm thinking about only COPY FROM for reads, but if someone wants to add
>> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.

> I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."

Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
that Microsloth does it does not make it standards-conformant.

I think that accepting it on input can be sensible, on the principle of
"be liberal in what you accept", but the other side of that is "be
conservative in what you send". No BOMs in output, please.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 15:09:09
Message-ID: 20110927.000909.594224957113812106.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> "David E. Wheeler" <david(at)kineticode(dot)com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ(at)mail(dot)gmail(dot)com> writes:
>> On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:
>>> I'm thinking about only COPY FROM for reads, but if someone wants to add
>>> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.
>
>> I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."
>
> Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
> that Microsloth does it does not make it standards-conformant.
>
> I think that accepting it on input can be sensible, on the principle of
> "be liberal in what you accept", but the other side of that is "be
> conservative in what you send". No BOMs in output, please.

Suppose a user uses brain-dead editor, which does not accept UTF-8
without BOM. He decides to save his editor data into PostgreSQL using
COPY FROM. He extracts the data using COPY TO. Now he finds that his
stupid editor does not accept his data any more.

So I think if we decide to accept UTF-8 with BOM, we should keep BOM
when importing the data and output the data with BOM. If we don't want
to output UTF-8 with BOM, we should not accept UTF-8 with BOM. It
seems we don't have much choice...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 16:07:26
Message-ID: CA+Tgmoa7SzcuViKfdbmWWeRmzZnjo93AmbhiOHaO9E=330PFow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> "David E. Wheeler" <david(at)kineticode(dot)com> <CAJW2+qdYg1+xLaHDqnJs3AcKmCSVCDkv_LCAPWUtwmxL9dzVhQ(at)mail(dot)gmail(dot)com> writes:
>>> On Sep 25, 2011, at 9:58 PM, Itagaki Takahiro wrote:
>>>> I'm thinking about only COPY FROM for reads, but if someone wants to add
>>>> BOM in COPY TO, we might also support COPY TO WITH BOM for writes.
>>
>>> I think it would have to be optional, since "some recipients of UTF-8 encoded data do not expect a BOM."
>>
>> Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
>> that Microsloth does it does not make it standards-conformant.
>>
>> I think that accepting it on input can be sensible, on the principle of
>> "be liberal in what you accept", but the other side of that is "be
>> conservative in what you send".  No BOMs in output, please.
>
> Suppose a user uses brain-dead editor, which does not accept UTF-8
> without BOM.  He decides to save his editor data into PostgreSQL using
> COPY FROM. He extracts the data using COPY TO. Now he finds that his
> stupid editor does not accept his data any more.
>
> So I think if we decide to accept UTF-8 with BOM, we should keep BOM
> when importing the data and output the data with BOM. If we don't want
> to output UTF-8 with BOM, we should not accept UTF-8 with BOM. It
> seems we don't have much choice...

Maybe this needs to be an optional behavior, controlled by some COPY option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 17:15:29
Message-ID: 9739.1317057329@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> Suppose a user uses brain-dead editor, which does not accept UTF-8
>> without BOM.

> Maybe this needs to be an optional behavior, controlled by some COPY option.

I'm not excited about emitting non-standards-conformant output on the
strength of a hypothetical argument about users and editors that may or
may not exist. I believe that there's a use-case for reading BOMs, but
I have seen no field complaints demonstrating that we need to write
them. Even if we had a couple, "use a less brain dead editor" might be
the best response. We cannot promise to be compatible with arbitrarily
broken software.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 17:19:26
Message-ID: CA+TgmoZNw=F-+fvpH8xpeiph6kiAK1Vk1Ch4ONu6d+N-UG++5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 1:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Sep 26, 2011 at 11:09 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>>> Suppose a user uses brain-dead editor, which does not accept UTF-8
>>> without BOM.
>
>> Maybe this needs to be an optional behavior, controlled by some COPY option.
>
> I'm not excited about emitting non-standards-conformant output on the
> strength of a hypothetical argument about users and editors that may or
> may not exist.  I believe that there's a use-case for reading BOMs, but
> I have seen no field complaints demonstrating that we need to write
> them.  Even if we had a couple, "use a less brain dead editor" might be
> the best response.  We cannot promise to be compatible with arbitrarily
> broken software.

The thing that makes me doubt that is this comment from Tatsuo Ishii:

TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

If a BOM is confusable with valid data, then I think recognizing it
and discarding it unconditionally is no good - you could end up where
COPY OUT, TRUNCATE, COPY IN changes the table contents.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 17:28:15
Message-ID: 9978.1317058095@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The thing that makes me doubt that is this comment from Tatsuo Ishii:
> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

Yeah, that's a reasonable argument for rejecting the patch altogether.
I'm not qualified to decide whether it outweighs the "we need to be able
to read Notepad output" argument. I do observe that
http://en.wikipedia.org/wiki/Byte_order_mark
says Unicode 3.2 has deprecated the no-break-space interpretation,
but on the other hand you're right that we can't really assume that
the character is not present in people's data.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 17:34:23
Message-ID: CA+TgmobGLpx0gEzk_qDUddzpMxYxbG-xGWJQ79iZwXSHfcZcLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).  So
>> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
>> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>
> Yeah, that's a reasonable argument for rejecting the patch altogether.

Yeah, or for making the behavior optional.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 17:52:27
Message-ID: 10446.1317059547@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Sep 26, 2011 at 1:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>>> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
>>> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
>>> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".

>> Yeah, that's a reasonable argument for rejecting the patch altogether.

> Yeah, or for making the behavior optional.

Sorry, I should have been clearer: it's an argument for rejecting *this*
patch. A patch that introduced a "BOM" option for COPY (which logically
could apply just as well to input or output) would be a different patch.

BTW, another issue with the patch-as-proposed is that it assumes,
without even checking, that fseek() will work (for that matter, it would
also fail pretty miserably on a file shorter than 3 bytes). We could
dodge that problem with an option since it would be reasonable to define
the option as meaning that there MUST be a BOM there. I would envision
it as acting much like the CSV HEADER option.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:35:53
Message-ID: 1317062154.29925.5.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On tis, 2011-09-27 at 00:09 +0900, Tatsuo Ishii wrote:
> Suppose a user uses brain-dead editor, which does not accept UTF-8
> without BOM.

I would first like to see evidence that such an editor exists.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:38:32
Message-ID: 1317062314.29925.7.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:
> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>
> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
> So
> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
> should
> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>
> If a BOM is confusable with valid data, then I think recognizing it
> and discarding it unconditionally is no good - you could end up where
> COPY OUT, TRUNCATE, COPY IN changes the table contents.

We did recently accept a patch for psql -f to skip over a UTF-8
byte-order mark. We had a lot of this same discussion there.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:44:12
Message-ID: CA+Tgmoa0UjF8qebP--e6PgKzL9uJ4jfAxhH-ihmUPyBwK8ANqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 26, 2011 at 2:38 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:
>> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>>
>> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
>> So
>> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
>> should
>> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>>
>> If a BOM is confusable with valid data, then I think recognizing it
>> and discarding it unconditionally is no good - you could end up where
>> COPY OUT, TRUNCATE, COPY IN changes the table contents.
>
> We did recently accept a patch for psql -f to skip over a UTF-8
> byte-order mark.  We had a lot of this same discussion there.

But that case is different, because zero-width, non-breaking space has
no particular meaning in an SQL script - it's either going to be
ignored as a BOM, ignored as whitespace, or an error. But inside a
file being subjected to COPY it might be confusable with data that the
user wanted to end up in some table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:47:06
Message-ID: 4E80C8AA.8060307@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2011 02:38 PM, Peter Eisentraut wrote:
> On mån, 2011-09-26 at 13:19 -0400, Robert Haas wrote:
>> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>>
>> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case).
>> So
>> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we
>> should
>> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>>
>> If a BOM is confusable with valid data, then I think recognizing it
>> and discarding it unconditionally is no good - you could end up where
>> COPY OUT, TRUNCATE, COPY IN changes the table contents.
> We did recently accept a patch for psql -f to skip over a UTF-8
> byte-order mark. We had a lot of this same discussion there.
>
>

Yes, but wasn't part of the rationale that this was safe because a
leading BOM could not possibly be mistaken for anything else legitimate
in an SQL source file? That's quite different from a data file. ISTM.

cheers

andrew


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:49:16
Message-ID: 1317062957.29925.11.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-09-26 at 14:44 -0400, Robert Haas wrote:
> > We did recently accept a patch for psql -f to skip over a UTF-8
> > byte-order mark. We had a lot of this same discussion there.
>
> But that case is different, because zero-width, non-breaking space has
> no particular meaning in an SQL script - it's either going to be
> ignored as a BOM, ignored as whitespace, or an error. But inside a
> file being subjected to COPY it might be confusable with data that the
> user wanted to end up in some table.

Yes, my point was more directed toward the discussion about whether BOM
in UTF-8 are valid at all. But your point pretty much kills this
altogether. If I store a BOM in row 1, column 1 of my table, because,
well, maybe it's an XML document or something, then it needs to be able
to survive a copy out and in. The only way we could proceed with this
would be if we prohibited BOMs in all user-data.


From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 18:57:25
Message-ID: 4E80CB15.10706@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Putting a BOM into UTF8 data is flat out invalid per spec --- the fact
> that Microsloth does it does not make it standards-conformant.

Could you share a pointer to the spec?
All I've ever heard is that a BOM is optional for UTF-8 but not forbidden.

The Unicode FAQ (http://unicode.org/faq/utf_bom.html#BOM) states "that
some recipients of UTF-8 encoded data do not expect a BOM".
Postgres obviously belongs to those recipients.
That's why all my psql-scripts transferring data from MSSQL to Postgres
need a '\! perl -CD -pi.orig -e "tr/\x{feff}//d" "C:/datafile.txt"'
before feeding data into COPY TO.

Reading it tolerantly and writing it on user request is probably the way
that would help most users.

Regards,

Brar


From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 19:11:53
Message-ID: 4E80CE79.2090405@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Yeah, that's a reasonable argument for rejecting the patch altogether.
> I'm not qualified to decide whether it outweighs the "we need to be able
> to read Notepad output" argument.

Actually it's not only notepad.

I quite often find myself doing something like the following when moving
data from MSSQL to PostgreSQL.

\echo Fetching data for table "patient"
\! sqlcmd -S DBSERVER -d DATABASE -E -f 65001 -o "C:/datafile.txt" -h -1
-W -s "|" -Q "SET NOCOUNT ON; SELECT * FROM my_table;"
\! perl -CD -pi.orig -e "tr/\x{feff}//d" "C:/datafile.txt"

\echo Importing data into table "patient"
\copy my_table FROM 'C:/datafile.txt' WITH DELIMITER '|' NULL 'NULL'

Regards,

Brar


From: Brar Piening <brar(at)gmx(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 19:21:14
Message-ID: 4E80D0AA.4080906@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>
> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>
> If a BOM is confusable with valid data, then I think recognizing it
> and discarding it unconditionally is no good - you could end up where
> COPY OUT, TRUNCATE, COPY IN changes the table contents.

Citing from the Unicode FAQ again:

Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as text, but
for which it is not known whether they are in big or little endian
format—it can also serve as a hint indicating that the file is in
Unicode, as opposed to in a legacy encoding and furthermore, it act as a
signature for the specific encoding form used.

I think that the major hint in the answer is "beginning of files".

To correctly handle a BOM you need to be sure to be in the context of
files that have defined bounds (especially a *beginning*) you can't
properly handle a BOM in arbitrary streams.

Regards,

Brar


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brar Piening <brar(at)gmx(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 19:48:53
Message-ID: 29877.1317066533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brar Piening <brar(at)gmx(dot)de> writes:
> Citing from the Unicode FAQ again:

> Q: Where is a BOM useful?
> A: A BOM is useful at the beginning of files that are typed as text, but
> for which it is not known whether they are in big or little endian
> formatit can also serve as a hint indicating that the file is in
> Unicode, as opposed to in a legacy encoding and furthermore, it act as a
> signature for the specific encoding form used.

Note that the reference to byte order betrays the implicit context
assumption: that we're talking about UTF16 or UTF32 representation.
A BOM in UTF8 data is useless for its intended purpose of disambiguating
byte order. It could possibly be useful for telling UTF8 data apart
from non-UTF8 data, except for the inconvenient fact that that byte
sequence is not invalid data in non-UTF8 encodings.

BOM is useless in UTF8, no matter what Microsoft thinks. Any tool that
relies on it to detect UTF8 data has to have a workaround for overriding
that detection, or it's broken to the point of uselessness.

regards, tom lane


From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 05:49:58
Message-ID: 4E816406.1050001@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Note that the reference to byte order betrays the implicit context
> assumption: that we're talking about UTF16 or UTF32 representation.
Note that there is no implicit context assumption in the Unicode FAQ.
It's equally covering UTF-8, UTF-16 and UTF-32.
Another quote:
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?
A: Yes, UTF-8 can contain a BOM. However, it makes /no/ difference as to
the endianness of the byte stream. UTF-8 always has the same byte order.
An initial BOM is /only/ used as a signature --- an indication that an
otherwise unmarked text file is in UTF-8. Note that some recipients of
UTF-8 encoded data do not expect a BOM. Where UTF-8 is
used/transparently/ in 8-bit environments, the use of a BOM will
interfere with any protocol or file format that expects specific ASCII
characters at the beginning, such as the use of "#!" of at the beginning
of Unix shell scripts.
>
> BOM is useless in UTF8, no matter what Microsoft thinks. Any tool that
> relies on it to detect UTF8 data has to have a workaround for overriding
> that detection, or it's broken to the point of uselessness.
This kind of brokenness is currently existing the other way around (see
my reference to the perl script I' using to work aound it).

Note also that I'm not citing a Microsoft FAQ but the Unicode FAQ.
I'm also not trying to convert Postgres into a Microsoft tool (I'm
pretty happy it isn't) but I'm pointing to existing compatibility issues
on a Platform that others have decided to support.
Belonging to the huge group of users who have little or no choice in
what OS they are using and being from a country where plain ASCII isn't
enough to cover all existing characters this is probably fair.

It's a pity that the Unicode standard actually allows something that can
cause problems but blaming the non-platform again doesn't solve the
existing issues.

Regards,

Brar


From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 05:58:16
Message-ID: 4E8165F8.1040801@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brar Piening wrote:
>
> It's a pity that the Unicode standard actually allows something that
> can cause problems but blaming the non-platform again doesn't solve
> the existing issues.

To put in a more humoruos but actually correct way:

M$ has found a standard conforming way of preventing users to migrate
data from MSSQL to PostgreSQL.
Do you want to work arond it?

Regards,

Brar


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 14:21:37
Message-ID: 1317133297.10184.1.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-09-26 at 21:49 +0300, Peter Eisentraut wrote:
> If I store a BOM in row 1, column 1 of my table, because,
> well, maybe it's an XML document or something, then it needs to be
> able to survive a copy out and in. The only way we could proceed with
> this would be if we prohibited BOMs in all user-data.

Alternative consideration: We could allow this in CSV format if we made
users quote the first value if it starts with a BOM. This might be a
reasonable way to get MS compatibility.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 14:32:03
Message-ID: 22884.1317133923@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Alternative consideration: We could allow this in CSV format if we made
> users quote the first value if it starts with a BOM. This might be a
> reasonable way to get MS compatibility.

I don't think we can get away with a retroactive restriction on the
contents of data files.

If we're going to do this at all, I still think an explicit BOM option
for COPY, to either eat (and require) a BOM on input or emit a BOM on
output, would be the sanest way. None of the "automatic" approaches
seem safe to me.

regards, tom lane