Quick Links

Re: UTF8 with BOM support in psql

Lists:	pgsql-hackers

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	UTF8 with BOM support in psql
Date:	2009-10-20 05:41:11
Message-ID:	20091020143811.379B.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

UTF8 encoding text files with BOM (Byte Order Mark) are commonly
used in Windows, though BOM was designed for UTF16 text originally.
However, psql cannot read such format even if we set client encoding
to UTF8. Is it worth supporting those format in psql?

When psql opens a file with -f or \i, it checks first 3 bytes of the
file. If they are BOM, discard the 3 bytes and change client encoding
to UTF8 automatically.

Is this change reasonable? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 05:58:58
Message-ID:	200910200558.n9K5wwM11713@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro wrote:
> UTF8 encoding text files with BOM (Byte Order Mark) are commonly
> used in Windows, though BOM was designed for UTF16 text originally.
> However, psql cannot read such format even if we set client encoding
> to UTF8. Is it worth supporting those format in psql?
>
> When psql opens a file with -f or \i, it checks first 3 bytes of the
> file. If they are BOM, discard the 3 bytes and change client encoding
> to UTF8 automatically.
>
> Is this change reasonable? Comments welcome.

Seems there is community support for accepting BOM:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Should I add this as a TODO item?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 06:18:49
Message-ID:	20091020151042.379E.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> Itagaki Takahiro wrote:
> > When psql opens a file with -f or \i, it checks first 3 bytes of the
> > file. If they are BOM, discard the 3 bytes and change client encoding
> > to UTF8 automatically.
>
> Seems there is community support for accepting BOM:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Thank yor for information.
I read the thread that we discussed about BOM handling in *data types*.
I agree the decision in the thead that we should not skip BOM characters,
but we can handle BOM in a different way in the head of *files* for psql
and COPY input.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 09:54:41
Message-ID:	1256032481.9382.19.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote:
> UTF8 encoding text files with BOM (Byte Order Mark) are commonly
> used in Windows, though BOM was designed for UTF16 text originally.
> However, psql cannot read such format even if we set client encoding
> to UTF8. Is it worth supporting those format in psql?

psql doesn't have a problem, but the backend's lexer doesn't parse the
BOM as whitespace. Since the lexer is byte-based, it will presumably
have problems with anything outside of ASCII that Unicode considers
whitespace.

> When psql opens a file with -f or \i, it checks first 3 bytes of the
> file. If they are BOM, discard the 3 bytes and change client encoding
> to UTF8 automatically.

While I see that the Unicode standard supports using a UTF-8 encoded BOM
as UTF-8 signature, I wonder if those bytes can usefully appear in a
leading position in other encodings.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 14:36:46
Message-ID:	26930.1256049406@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Seems there is community support for accepting BOM:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

That discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.

In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 15:13:19
Message-ID:	4ADDD38F.50804@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>
>> Seems there is community support for accepting BOM:
>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php
>>
>
> That discussion has approximately nothing to do with the
> much-more-invasive change that Itagaki-san is suggesting.
>
> In particular I think an automatic change of client_encoding isn't
> particularly a good idea --- wouldn't you have to change it back later,
> and is there any possibility of creating a security issue from such
> behavior? Remember that client_encoding *IS* tied to security issues
> such as backslash escape handling.
>
>
>

Yeah, I don't think we should be second-guessing the user on the encoding.

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 15:51:31
Message-ID:	28242.1256053891@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> What I think we might sensibly do is to eat the leading BOM of an SQL
> file iff the client encoding is UTF8, and otherwise treat it as just
> bytes in whatever the encoding is.

That seems relatively non-risky.

> Should we also do the same for files passed via \copy? What about
> streams on stdin? What about files read from the backend via COPY?

Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere. COPY files, maybe.

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Itagaki Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 15:54:49
Message-ID:	4ADD96F9020000250002BB9B@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> What I think we might sensibly do is to eat the leading BOM of an
> SQL file iff the client encoding is UTF8, and otherwise treat it as
> just bytes in whatever the encoding is.

Only at the beginning of the file or stream? What happens when people
concatenate files? Would it make sense to treat BOM as whitespace in
UTF-8, or maybe ignore it entirely?

-Kevin

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 16:02:01
Message-ID:	9837222c0910200902y4e0ad560o1c188e0f495f8b15@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2009/10/20 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> What I think we might sensibly do is to eat the leading BOM of an SQL
>> file iff the client encoding is UTF8, and otherwise treat it as just
>> bytes in whatever the encoding is.
>
> That seems relatively non-risky.

+1.

>> Should we also do the same for files passed via \copy? What about
>> streams on stdin? What about files read from the backend via COPY?
>
> Not thrilled about doing this on stdin --- you have no good
> justification for assuming that start of stdin corresponds to a file
> boundary somewhere. COPY files, maybe.

Yeah, that seems a lot more error-prone.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	David Christensen <david(at)endpoint(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-20 16:02:02
Message-ID:	94DDAFD4-9A1F-4F8D-8F82-9E632EE5964F@endpoint.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Oct 20, 2009, at 10:51 AM, Tom Lane wrote:

> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> What I think we might sensibly do is to eat the leading BOM of an SQL
>> file iff the client encoding is UTF8, and otherwise treat it as just
>> bytes in whatever the encoding is.
>
> That seems relatively non-risky.

Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:

$ PGCLIENTENCODING=...nonutf8...
$ psql -f <file>

Where <file> is:
<BOM>
...

SET CLIENT ENCODING 'utf8';

...
EOF

Regards,

David
--
David Christensen
End Point Corporation
david(at)endpoint(dot)com

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-21 04:11:59
Message-ID:	20091021114142.9561.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David Christensen <david(at)endpoint(dot)com> wrote:

> Is that only when the default client encoding is set to UTF8
> (PGCLIENTENCODING, whatever), or will it be coded to work with the
> following:
>
> $ psql -f <file>
> Where <file> is:
> <BOM>
> SET CLIENT ENCODING 'utf8';

Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

This patch does nothing about COPY and \copy commands. It might be
possible to add BOM handling code around AllocateFile() in CopyFrom()
to support "COPY FROM 'utf8file-with-bom.tsv'", but we need another
approach for "COPY FROM STDIN". For example,
$ echo utf8bom-1.tsv utf8bom-2.tsv | psql -c "COPY FROM STDIN"
might contain BOM sequence in the middle of input stream.
Anyway, those changes would come from another patch in the future.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment	Content-Type	Size
psql-utf8bom_20091021.patch	application/octet-stream	737 bytes

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-21 10:00:08
Message-ID:	1256119208.12996.1.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-21 13:08:23
Message-ID:	4ADF07C7.3000608@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
>
>> The attached patch replace BOM with while spaces, but it does not
>> change client encoding automatically. I think we can always ignore
>> client encoding at the replacement because SQL command cannot start
>> with BOM sequence. If we don't ignore the sequence, execution of
>> the script must fail with syntax error.
>>
>
> I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
> be ignored everywhere, all the time.
>
>

I suggest you re-read the Unicode FAQ on the subject. That is not the
conclusion I came to after I read it. Quite the reverse in fact.

cheers

andrew

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-10-24 21:33:06
Message-ID:	1256419986.15589.12.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.

I think we can't do that. That byte sequence might be valid user data
in other encodings.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-14 10:46:47
Message-ID:	1258195607.14314.20.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> Client encoding is declared in body of a file, but BOM is
> in head of the file. So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.
>
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.

There is also the notion of how files are handled on Unix. Normally,
you'd assume that all of

psql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psql

behave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.

Any ideas?

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-14 13:06:01
Message-ID:	4AFEAB39.3000009@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
>
>> Client encoding is declared in body of a file, but BOM is
>> in head of the file. So, we should always ignore BOM sequence
>> at the file head no matter what client encoding is used.
>>
>> The attached patch replace BOM with while spaces, but it does not
>> change client encoding automatically. I think we can always ignore
>> client encoding at the replacement because SQL command cannot start
>> with BOM sequence. If we don't ignore the sequence, execution of
>> the script must fail with syntax error.
>>
>
> I don't know what the best solution is here. The BOM encoded as UTF-8
> is valid data in other encodings. Of course, there is your point that
> such data cannot be at the start of an SQL command.
>
> There is also the notion of how files are handled on Unix. Normally,
> you'd assume that all of
>
> psql -f file.sql
> psql < file.sql
> cat file.sql | psql
> cat file1.sql file2.sql | psql
>
> behave consistently. That would require that the BOM is ignored in the
> middle of the data stream (which is legal and required per Unicode
> standard) and that this only happens if the character set is actually
> Unicode.
>
>
>

Cases 2 and 3 should be indistinguishable from psql's POV, although case
3 wins a "Useless Use of cat" award.

If we are only eating a BOM at the start of a file, which was the
consensus IIRC, and we treat STDIN as a file for this purpose, then we
would eat the leading BOM on file.sql and file1.sql in all the cases
above but not on file2.sql since we would not have any idea where the
file boundary was. That last case strikes me as a not very likely usage
(I'm pretty sure I've never used it, at least). A file containing:

\i file1.sql
\i file2.sql

would be the workaround if needed.

As for handling the fact that client encoding can't be set in a script
until after the leading BOM, there is always

PGOPTIONS="-c client_encoding=UTF8"

or similar.

cheers

andrew

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-16 20:37:07
Message-ID:	1258403827.21773.9.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> Sure. Client encoding is declared in body of a file, but BOM is
> in head of the file. So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.
>
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

OK, I think the consensus here is:

- Eat BOM at beginning of file (as you implemented)

- Only when client encoding is UTF-8 --> please fix that

I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-16 21:01:53
Message-ID:	7995.1258405313@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

So wouldn't it be better to remove the three bytes, rather than
replace with spaces? The latter will certainly confuse clients that
think that "column 1" means what they think is the first character.
A syntax error in the first line of the file should be sufficient
to demonstrate the issue.

regards, tom lane

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 00:31:51
Message-ID:	20091117093151.14F2.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

> OK, I think the consensus here is:
> - Eat BOM at beginning of file (as you implemented)
> - Only when client encoding is UTF-8 --> please fix that

Are they AND condition? If so, this patch will be useless.
Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
like:
Defining Python Source Code Encodings:
http://www.python.org/dev/peps/pep-0263/

> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

I assumed psql discards whitespaces automatically, but I see it is
more robust to remove BOM bytes explitly. I'll fix it.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 00:37:35
Message-ID:	15613.1258418255@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file. I'll agree if the condition is
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>",

As has been stated multiple times, that will not get accepted,
because it will *break* files in other encodings that chance to
match the BOM pattern.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 00:51:26
Message-ID:	4B01F38E.1030402@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>
>
>> OK, I think the consensus here is:
>> - Eat BOM at beginning of file (as you implemented)
>> - Only when client encoding is UTF-8 --> please fix that
>>
>
> Are they AND condition? If so, this patch will be useless.
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file. I'll agree if the condition is
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
> like:
> Defining Python Source Code Encodings:
> http://www.python.org/dev/peps/pep-0263/
>

As previously discussed we should not be automagically setting the
client encoding, nor inferring it from the presence of a BOM.

As for when it can be set, unless I'm mistaken you should be able to set
it before any file is opened, if you need to, using PGOPTIONS or psql
"dbname=mydb options='-c client_encoding=utf8'". Of course, if the
server encoding is utf8 then, in the absence of it being set using those
methods, the client encoding will start as utf8 also.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 01:03:02
Message-ID:	16034.1258419782@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> As for when it can be set, unless I'm mistaken you should be able to set
> it before any file is opened, if you need to, using PGOPTIONS or psql
> "dbname=mydb options='-c client_encoding=utf8'". Of course, if the
> server encoding is utf8 then, in the absence of it being set using those
> methods, the client encoding will start as utf8 also.

It could also be set in ~/.psqlrc, which would probably be the most
convenient method for regular users of UTF8 files who need to talk
to non-UTF8 databases.

regards, tom lane

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 02:30:46
Message-ID:	20091117113046.14F9.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> > if you need to, using PGOPTIONS or psql
> > "dbname=mydb options='-c client_encoding=utf8'".
>
> It could also be set in ~/.psqlrc, which would probably be the most
> convenient method for regular users of UTF8 files who need to talk
> to non-UTF8 databases.

It's nonsense. Users often use scripts written in difference encodings
at once. Encoding information should be packed in script file itself.
We should not force users to open script files and check its encoding
before they execute the files.

BTW, I have an idea to improve handling of per-file encoding.
We continue to use the encoding settings specified in included file
at \i command. But should the setting be reverted at the end of file?
ie.

=# \encoding SJIS
=# \i script-in-utf8.sql
=# -- encoding should be SJIS here.

If encoding setting is reverted,
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
will be much safer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 03:44:28
Message-ID:	19199.1258429468@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> If encoding setting is reverted,
>> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
> will be much safer.

This isn't going to happen, so please stop wasting our time arguing
about it.

regards, tom lane

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 05:19:58
Message-ID:	20091117141958.150B.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > If encoding setting is reverted,
> >> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
> > will be much safer.
>
> This isn't going to happen, so please stop wasting our time arguing
> about it.

Ok, sorry. But I still cannot accept this restriction.
>> - Only when client encoding is UTF-8 --> please fix that

The attachd patch is a new proposal of the feature.
When we found BOM at beginning of file, set "expected_encoding" to UTF8.
Before every execusion of query, if pset.encoding is not UTF8, we check the
query string not to contain any non-ASCII characters and throw an error if
found. Encoding declarations are typically written only in ascii characters,
so we can postpone encoding checking until non-ascii characters appear.

Since the default value of expected_encoding is SQL_ASCII, that pass
through all characters, so the patch does nothing to scripts without BOM.
(There are no codes to set expected_encoding except BOM.)
If client encoding is UTF8, it skips BOM and no effect to the script body.
BOMs are skipped even if client encoding is not set to UTF8, but can throw
an error if there are no explicit encoding declaration.

AFAIC, the patch can solve the almost problems in the discussions
developmentally. Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment	Content-Type	Size
psql-utf8bom_20091117.patch	application/octet-stream	2.9 KB

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 07:02:17
Message-ID:	1258441337.10724.13.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2009-11-17 at 14:19 +0900, Itagaki Takahiro wrote:
> The attachd patch is a new proposal of the feature.
> When we found BOM at beginning of file, set "expected_encoding" to UTF8.
> Before every execusion of query, if pset.encoding is not UTF8, we check the
> query string not to contain any non-ASCII characters and throw an error if
> found. Encoding declarations are typically written only in ascii characters,
> so we can postpone encoding checking until non-ascii characters appear.
>
> Since the default value of expected_encoding is SQL_ASCII, that pass
> through all characters, so the patch does nothing to scripts without BOM.
> (There are no codes to set expected_encoding except BOM.)
> If client encoding is UTF8, it skips BOM and no effect to the script body.
> BOMs are skipped even if client encoding is not set to UTF8, but can throw
> an error if there are no explicit encoding declaration.

I think I could support using the presence of the BOM as a fall-back
indicator of encoding in absence of any other declaration. It seems to
me, however, that the description above ignores the existence of
encodings other than SQL_ASCII and UTF8.

Also, when the proposed patch to set the encoding from the locale
appears, we need to make this logic more precise. Something like:

1. set client_encoding or \encoding, otherwise
2. if BOM found, then UTF8, otherwise
3. by locale environment, otherwise
4. SQL_ASCII (= server encoding, effectively)

Also, I'm not sure if we need this logic only when we send a query. It
might be better to do this in the lexer when we find a non-ASCII
character and we don't have a client encoding != SQL_ASCII set yet.

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 07:40:23
Message-ID:	20091117164023.1513.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

> I think I could support using the presence of the BOM as a fall-back
> indicator of encoding in absence of any other declaration.

What is the difference the fall-back and <<set client encoding to UTF-8
if BOM found>> ? I read this discussion that we cannot accept any automatic
encoding detections (properly speaking, detection is ok, but automatic
assignment is not). We should not have any fall-back mechanism, no?

> Also, when the proposed patch to set the encoding from the locale
> appears, we need to make this logic more precise.

Encoding-from-locale feature will be useful, but the patch does *not*
set any encodings. The reason is same as above.

> Also, I'm not sure if we need this logic only when we send a query. It
> might be better to do this in the lexer when we find a non-ASCII
> character and we don't have a client encoding != SQL_ASCII set yet.

Absolutely, but is it an indepedent issue from BOM? Multi-byte scripts
without encoding are always dangerous whether BOM is present or not.
I'd say we can always throw an error when we find queries that contain
multi-byte characters if no prior encoding declaration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 08:59:25
Message-ID:	2106D8DC89010842BABA5CD03FEA4061012E8BE3B9@EXVMBX018-10.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
> I don't know what the best solution is here. The BOM encoded as UTF-8
> is valid data in other encodings. Of course, there is your point that
> such data cannot be at the start of an SQL command.
>

Is the UTF-8 BOM ( EF BB BF ) actually valid data in any other multi-byte encoding (other than it's intended use in UTF-8)?

I realize that for single-byte encoding, such as latin-1, it would be legal as data, since any bytes other that 00 are legal, although never legal outside a quoted string in a SQL command or psql command.

Certainly, no psql command input file can start with those bytes, or you would get an error (unless it is changed so the BOM is ignored).

As to zero-width non-breaking space: the BOM is supposed to be treated as such if in the middle of a string, but if it is the start, it is just the BOM, and isn't considered part of the data, if I'm reading the spec right. Perhaps the lexers should allow for it as white space (along with other Unicode space characters, such as U+2060).
It's not really important, since allowing the BOM sequence in the middle of a file is "deprecated" according to the Unicode standard.

And what if you see a real BOM, FF FE or FE FF or FF FE 00 00 or 00 00 FE FF? Give an error saying UTF-16 and UTF-32 are not supported?

Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8, so psql and PostgreSQL understand it?
(BTW, that would actually be nice on Windows, where UTF-16 is common).

If we accept UTF-8 BOM, we should at least detect the other BOM sequences and give an error or warning.

Overall, from my user point of view, having psql deal with the BOM (at least the utf-8 one) would be more friendly than current behavior, as some editors (notepad for example) automatically add the BOM to the beginning of Unicode files, and it's not obvious without dumping the file in hex.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 14:08:12
Message-ID:	4B02AE4C.5060904@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro wrote:
> Multi-byte scripts
> without encoding are always dangerous whether BOM is present or not.
> I'd say we can always throw an error when we find queries that contain
> multi-byte characters if no prior encoding declaration.
>
>
>

You will break a gazillion scripts that today work quite happily if you do.

I think you have really not thought out these proposals well.

Maybe there is a case for a extra command line switch to set the initial
client encoding for psql, which would make that a little easier and less
obscure to do. Would that make things simpler for you?

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 15:50:02
Message-ID:	29075.1258473002@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> I think I could support using the presence of the BOM as a fall-back
> indicator of encoding in absence of any other declaration. It seems to
> me, however, that the description above ignores the existence of
> encodings other than SQL_ASCII and UTF8.

Yeah. This entire proposal rests on the assumption that UTF8 is the
only encoding that really matters, and introducing a possibility of
breaking things for users of other encodings is acceptable damage.
I do not think that supporting a deprecated-by-standards behavior
is worth that.

Even assuming that we had consensus on a behavior that involved
silently changing client_encoding, I do not believe that it's practical
to implement it in an acceptable fashion. Just issuing a SET behind the
user's back will not work in a number of scenarios:

* We are inside a transaction when \i is called, and the file contains
a ROLLBACK.

* We are inside a failed transaction when \i is called --- the SET won't
even work at all.

* Same two cases inside a savepoint.

* The file contains a \c command.

If you expect that the previous client_encoding should be restored at
the end of the \i inclusion (as I certainly would) then you have the
first three hazards at file end as well, except that now the odds of
being inside a failed transaction are significantly higher. Also,
what if the file contained a SET CLIENT_ENCODING command itself?
How should that interact with this?

Lastly, a silent change of client_encoding would also affect the
encoding of notice and error messages that come out while the \i
file is running. I fail to find that non-astonishing, either.

I think that the only way this sort of behavior could be implemented
without a bunch of broken corner cases would be if we put the
responsibility of encoding conversion inside psql, so that switching its
idea of the encoding was just a local change rather than something it
had to ask the backend to do, and it could be careful to apply the
encoding only to the data coming from the \i file. Which is possible,
perhaps, but it hardly seems that slightly-more-convenient BOM handling
is worth it.

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 17:03:10
Message-ID:	1258477390.10724.25.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2009-11-17 at 09:31 +0900, Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>
> > OK, I think the consensus here is:
> > - Eat BOM at beginning of file (as you implemented)
> > - Only when client encoding is UTF-8 --> please fix that
>
> Are they AND condition? If so, this patch will be useless.
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file.

Presumably, if you have editors throwing in BOM marks without asking,
you have an environment where either

a) You can set the client encoding to UTF8 in the environment, so it
applies by default, or

b) The server encoding is UTF8, so the client encoding will default to
that.

Together, that should cover a lot of cases. Not perfect, but far from
useless.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 17:05:24
Message-ID:	1258477524.10724.26.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote:
> Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8,
> so psql and PostgreSQL understand it?
> (BTW, that would actually be nice on Windows, where UTF-16 is common).

Well, someone could implement UTF-16 or UTF-whatever as client encoding.
But I have not heard of any concrete proposals about that.

From:	Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 17:10:07
Message-ID:	2106D8DC89010842BABA5CD03FEA4061012E8BE3FD@EXVMBX018-10.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: Peter Eisentraut [mailto:peter_e(at)gmx(dot)net]
> Sent: Tuesday, November 17, 2009 9:05 AM
> To: Chuck McDevitt
> Cc: Itagaki Takahiro; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] UTF8 with BOM support in psql
>
> On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote:
> > Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8,
> > so psql and PostgreSQL understand it?
> > (BTW, that would actually be nice on Windows, where UTF-16 is common).
>
> Well, someone could implement UTF-16 or UTF-whatever as client encoding.
> But I have not heard of any concrete proposals about that.

Certainly that would be nice, given that UTF-16 is the "native" encoding for Java, C#, Visual Basic.net, JDBC, ODBC drivers >= ver 3.5, Microsoft Windows (all system calls use UTF-16, with a compatibility layer for old apps), and apps that Postgres users might switch from, such as MS SQLServer.

But for the short term, a warning or error saying we don't support it is better than a confusing lexer error or syntax error.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 17:14:41
Message-ID:	4B02DA01.5060208@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote:
>
>> Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8,
>> so psql and PostgreSQL understand it?
>> (BTW, that would actually be nice on Windows, where UTF-16 is common).
>>
>
> Well, someone could implement UTF-16 or UTF-whatever as client encoding.
> But I have not heard of any concrete proposals about that.
>
>

Doesn't the nul byte problem make that seriously hard?

cheers

andrew

From:	Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 17:49:44
Message-ID:	2106D8DC89010842BABA5CD03FEA4061012E8BE408@EXVMBX018-10.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: Andrew Dunstan [mailto:andrew(at)dunslane(dot)net]
> Sent: Tuesday, November 17, 2009 9:15 AM
> To: Peter Eisentraut
> Cc: Chuck McDevitt; Itagaki Takahiro; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] UTF8 with BOM support in psql
>
>
>
> Peter Eisentraut wrote:
> > On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote:
> >
> >> Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8,
> >> so psql and PostgreSQL understand it?
> >> (BTW, that would actually be nice on Windows, where UTF-16 is
> common).
> >>
> >
> > Well, someone could implement UTF-16 or UTF-whatever as client
> encoding.
> > But I have not heard of any concrete proposals about that.
> >
> >
>
> Doesn't the nul byte problem make that seriously hard?
>

Not really... You can't treat UTF-16 the same way you do UTF-8, but we are talking about it being a client_encoding, not a server_encoding. So, it's only the routines that look at the strings pre-conversion, and the conversion routines themselves, that need to understand UTF-16 strings are 16-bits at a time, and end with a 16 bit 0x0000.
Obviously, it's more work than handing another 8-bit client_encoding, but doesn't seem insurmountable.
And given the 1:1 mapping from UTF-16 to UTF-8, you don't have any new issues due to characters that can't be converted.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 18:01:52
Message-ID:	1442.1258480912@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Peter Eisentraut wrote:
>> Well, someone could implement UTF-16 or UTF-whatever as client encoding.
>> But I have not heard of any concrete proposals about that.

> Doesn't the nul byte problem make that seriously hard?

Just about impossible. It would require a protocol bump, and removal of
C-style string usage *everywhere* on the client side.

Again, this is something that might be more feasible with encoding
conversion inside psql --- translating UTF16 to UTF8 immediately upon
reading it from any external file would confine the problem to possibly
manageable bounds.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 18:22:24
Message-ID:	4B02E9E0.1010906@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Peter Eisentraut wrote:
>>
>>> Well, someone could implement UTF-16 or UTF-whatever as client encoding.
>>> But I have not heard of any concrete proposals about that.
>>>
>
>
>> Doesn't the nul byte problem make that seriously hard?
>>
>
> Just about impossible. It would require a protocol bump, and removal of
> C-style string usage *everywhere* on the client side.
>
> Again, this is something that might be more feasible with encoding
> conversion inside psql --- translating UTF16 to UTF8 immediately upon
> reading it from any external file would confine the problem to possibly
> manageable bounds.
>
>
>

Well, it might be a good idea to provide at least some support in libpq.
Making each client do it from scratch seems a bit inefficient.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Chuck McDevitt <cmcdevitt(at)greenplum(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-17 18:51:26
Message-ID:	2214.1258483886@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Well, it might be a good idea to provide at least some support in libpq.
> Making each client do it from scratch seems a bit inefficient.

Encoding conversion seems far outside libpq's charter, and as for
"from scratch" there are other libraries for that.

regards, tom lane

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 03:52:53
Message-ID:	20091118125253.A48F.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:

> Together, that should cover a lot of cases. Not perfect, but far from
> useless.

For Japanese users on Windows, the client encoding are always set to SJIS
because of the restriction of cmd.exe. But the script file can be written
in UTF8 with BOM. I don't think we should depend on client encoding.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 04:03:39
Message-ID:	20091118130339.A493.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> Itagaki Takahiro wrote:
> > Multi-byte scripts
> > without encoding are always dangerous whether BOM is present or not.
> > I'd say we can always throw an error when we find queries that contain
> > multi-byte characters if no prior encoding declaration.
>
> You will break a gazillion scripts that today work quite happily if you do.

Sure. That's why I didn't send a patch for it :)
If by any chance we do so, we'll have a boolean option to disable the check.

> Maybe there is a case for a extra command line switch to set the initial
> client encoding for psql, which would make that a little easier and less
> obscure to do. Would that make things simpler for you?

No. There are complex reasons on Windows in Japan. The client encoding is
always SJIS because of Windows restriction, but the database is initialized
with UTF8. Simple interactive works with psql are done under SJIS encoding,
but some scripts are written in UTF8 because it matches the server encoding.
(Of course the script is executed as "psql -f utf8.sql > out.txt")

I don't want user to check the encoding of scripts before executing --
it is far from fail-safe.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 04:22:34
Message-ID:	4B03768A.4050500@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro wrote:
> I don't want user to check the encoding of scripts before executing --
> it is far from fail-safe.
>
>
>

That's what we require in all other cases. Why should UTF8 be special?
If I have a script in Latin1 and Postgres thinks it's UTF8 it will
probably explode. Same for the reverse situation. Second-guessing the
user strikes me as being quite as dangerous as what you're trying to
cure, for all the reasons Tom outline earlier today. What is more, you
will teach Windows users to rely on the client encoding being set in
UTF8 scripts without their doing anything, and then when they get on
another platform they will not understand why it doesn't work because
the BOMs will be missing.

cheers

andrew

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 04:35:18
Message-ID:	20091118133518.A4A8.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> Itagaki Takahiro wrote:
> > I don't want user to check the encoding of scripts before executing --
> > it is far from fail-safe.
>
> That's what we require in all other cases. Why should UTF8 be special?

No. I didn't think about UTF-8 nor BOM in that point.
I assumed we are discussing the following line:

> > I'd say we can always throw an error when we find queries that contain
> > multi-byte characters if no prior encoding declaration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 09:11:59
Message-ID:	1258535519.3497.0.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2009-11-18 at 12:52 +0900, Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>
> > Together, that should cover a lot of cases. Not perfect, but far from
> > useless.
>
> For Japanese users on Windows, the client encoding are always set to SJIS
> because of the restriction of cmd.exe. But the script file can be written
> in UTF8 with BOM. I don't think we should depend on client encoding.

Set by whom, how, and because of what restriction?

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 09:18:22
Message-ID:	1258535902.3497.6.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2009-11-17 at 23:22 -0500, Andrew Dunstan wrote:
> Itagaki Takahiro wrote:
> > I don't want user to check the encoding of scripts before executing
> --
> > it is far from fail-safe.
> >
> >
> >
>
> That's what we require in all other cases. Why should UTF8 be special?

But now we're back to the original problem. Certain editors insert BOMs
at the beginning of the file. And that is by any definition before the
embedded client encoding declaration. I think the only ways to solve
this are:

1) Ignore the BOM if a client encoding declaration of UTF8 appears in a
narrowly defined location near the beginning of the file (XML and
PEP-0263 style). For *example*, we could ignore the BOM if the file
starts with exactly "<BOM>\encoding UTF8\n". Would probably not work
well in practice.

2) Parse two alternative versions of the file, one with the BOM ignored
and one with the BOM not ignored, until you need to make a decision.
Hilariously complicated, but would perhaps solve the problem.

3) Give up, do nothing.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 13:52:20
Message-ID:	4B03FC14.4080703@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut wrote:
> But now we're back to the original problem. Certain editors insert BOMs
> at the beginning of the file. And that is by any definition before the
> embedded client encoding declaration. I think the only ways to solve
> this are:
>
> 1) Ignore the BOM if a client encoding declaration of UTF8 appears in a
> narrowly defined location near the beginning of the file (XML and
> PEP-0263 style). For *example*, we could ignore the BOM if the file
> starts with exactly "<BOM>\encoding UTF8\n". Would probably not work
> well in practice.
>
> 2) Parse two alternative versions of the file, one with the BOM ignored
> and one with the BOM not ignored, until you need to make a decision.
> Hilariously complicated, but would perhaps solve the problem.
>
> 3) Give up, do nothing.
>
>

4) set the client encoding before the file is read in any of the ways
that have already been discussed and then allow psql to eat the BOM.

cheers

andrew

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 14:06:03
Message-ID:	1258553163.3497.35.camel@fsopti579.F-Secure.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On ons, 2009-11-18 at 08:52 -0500, Andrew Dunstan wrote:
> 4) set the client encoding before the file is read in any of the ways
> that have already been discussed and then allow psql to eat the BOM.

This is certainly a workaround, just like piping the file through a
suitable sed expression would be, but conceptually, the client encoding
is a property of the file and should therefore be marked in the file.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-18 15:18:34
Message-ID:	19630.1258557514@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> This is certainly a workaround, just like piping the file through a
> suitable sed expression would be, but conceptually, the client encoding
> is a property of the file and should therefore be marked in the file.

In a perfect world things would be like that, but the world is
imperfect. When only one of the available encodings even pretends
to have a marking convention, and even that one convention is broken,
imagining that you can fix it is just a recipe for making things worse.

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: UTF8 with BOM support in psql
Date:	2009-11-21 23:59:18
Message-ID:	1258847958.30675.9.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On mån, 2009-11-16 at 22:37 +0200, Peter Eisentraut wrote:
> On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> > Sure. Client encoding is declared in body of a file, but BOM is
> > in head of the file. So, we should always ignore BOM sequence
> > at the file head no matter what client encoding is used.
> >
> > The attached patch replace BOM with while spaces, but it does not
> > change client encoding automatically. I think we can always ignore
> > client encoding at the replacement because SQL command cannot start
> > with BOM sequence. If we don't ignore the sequence, execution of
> > the script must fail with syntax error.
>
> OK, I think the consensus here is:
>
> - Eat BOM at beginning of file (as you implemented)
>
> - Only when client encoding is UTF-8 --> please fix that
>
> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

I have committed a change that implements the above.