Re: utf8 COPY DELIMITER?

Lists: pgsql-hackers
From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: utf8 COPY DELIMITER?
Date: 2007-04-17 16:01:58
Message-ID: f02ulk$2r3u$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The \COPY command rejects multibyte delimiters. Is this intentional
behavior?

Here is an example of the behavior:

pgsql(at)compaq ~ $ touch foo
pgsql(at)compaq ~ $ psql -p 5555
Welcome to psql 8.3devel, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

pgsql=# create table foo (a integer);
CREATE TABLE
pgsql=# \copy foo from foo delimiter '標'
ERROR: COPY delimiter must be a single character
\copy: ERROR: COPY delimiter must be a single character

If your email/news reader doesn't render that properly, I'm using a
pictogram character for the delimiter.

I checked out a new copy of the sources from cvs this morning. It
behaves the same way on 8.2.3.

mark


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 16:28:18
Message-ID: 4624F5A2.10100@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Mark Dilger wrote:
>> The \COPY command rejects multibyte delimiters. Is this intentional
>> behavior?
>
> It is certainly a known limitation, and I suspect removing it could add
> non-trivial overhead to the input processing.
>
> What is the use case for using such a delimiter?

I'm working on fixing bugs relating to multibyte character encodings. I
wasn't sure whether this was a bug or not. I don't think we should
use the phrasing "COPY delimiter must be a single character" when, in
utf8 land, I did in fact use a single character. We might say "a single
byte", or we might extend the functionality to handle multibyte characters.

mark

> cheers
>
> andrew
>
>>
>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 17:13:49
Message-ID: 4625004D.7030503@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> The \COPY command rejects multibyte delimiters. Is this intentional
> behavior?

It is certainly a known limitation, and I suspect removing it could add
non-trivial overhead to the input processing.

What is the use case for using such a delimiter?

cheers

andrew

>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 17:37:58
Message-ID: 462505F6.3040405@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Dilger wrote:
> Andrew Dunstan wrote:
>> Mark Dilger wrote:
>>> The \COPY command rejects multibyte delimiters. Is this intentional
>>> behavior?
>>
>> It is certainly a known limitation, and I suspect removing it could
>> add non-trivial overhead to the input processing.
>>
>> What is the use case for using such a delimiter?
>
> I'm working on fixing bugs relating to multibyte character encodings.
> I wasn't sure whether this was a bug or not. I don't think we should
> use the phrasing "COPY delimiter must be a single character" when, in
> utf8 land, I did in fact use a single character. We might say "a
> single byte", or we might extend the functionality to handle multibyte
> characters.
>

Doing the latter would be a feature, and so is of course right off the
table for this release. Changing the error messages to be clearer should
be fine.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Mark Dilger <pgsql(at)markdilger(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 18:28:18
Message-ID: 4129.1176834498@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Mark Dilger wrote:
>> I'm working on fixing bugs relating to multibyte character encodings.
>> I wasn't sure whether this was a bug or not. I don't think we should
>> use the phrasing "COPY delimiter must be a single character" when, in
>> utf8 land, I did in fact use a single character. We might say "a
>> single byte", or we might extend the functionality to handle multibyte
>> characters.

> Doing the latter would be a feature, and so is of course right off the
> table for this release. Changing the error messages to be clearer should
> be fine.

+1 on changing the message: "character" is clearly less correct than "byte"
here.

I doubt that supporting a single multibyte character would be an
interesting extension --- if we wanted to do anything at all there, we'd
just generalize the delimiter to be an arbitrary string. But it would
certainly slow down COPY by some amount, which is an area where you'll
get push-back for performance losses, so you'd need to make a convincing
use-case for it.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Mark Dilger <pgsql(at)markdilger(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 18:33:40
Message-ID: 4358.1176834820@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On looking at the code, there's another issue: the CSV escape and quote
characters are assumed to be the same in client and server encodings,
because they're checked for before we do transcoding. This pretty much
restricts them to be ASCII.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: andrew(at)dunslane(dot)net, pgsql(at)markdilger(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 22:45:04
Message-ID: 20070418.074504.17591986.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On looking at the code, there's another issue: the CSV escape and quote
> characters are assumed to be the same in client and server encodings,
> because they're checked for before we do transcoding. This pretty much
> restricts them to be ASCII.
>
> regards, tom lane

+1.

The message in question should be something like:

"COPY delimiter must be a single ASCII character"
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: andrew(at)dunslane(dot)net, pgsql(at)markdilger(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-17 22:49:06
Message-ID: 15238.1176850146@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> The message in question should be something like:
> "COPY delimiter must be a single ASCII character"

If we phrase it like that we should enforce it like that --- ie, reject
high-bit-set characters.

But I'm a bit hesitant to do so, because it actually does work fine to
use a high-bit-set character as a delimiter as long as client and server
encodings are the same LATINx set. We'd be taking away functionality
for European users for no very good reason.

Is it worth going to the trouble of distinguish same-encoding and
different-encoding cases and applying a looser check for the former
case?

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, andrew(at)dunslane(dot)net, pgsql(at)markdilger(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-18 00:19:26
Message-ID: 200704180019.l3I0JQa27956@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii wrote:
> > On looking at the code, there's another issue: the CSV escape and quote
> > characters are assumed to be the same in client and server encodings,
> > because they're checked for before we do transcoding. This pretty much
> > restricts them to be ASCII.
> >
> > regards, tom lane
>
> +1.
>
> The message in question should be something like:
>
> "COPY delimiter must be a single ASCII character"

New text is:

The single ASCII character that separates columns within each row

Backpatched to 8.2.X.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: ishii(at)postgresql(dot)org, andrew(at)dunslane(dot)net, pgsql(at)markdilger(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-18 01:24:42
Message-ID: 20070418.102442.00486914.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> > The message in question should be something like:
> > "COPY delimiter must be a single ASCII character"
>
> If we phrase it like that we should enforce it like that --- ie, reject
> high-bit-set characters.
>
> But I'm a bit hesitant to do so, because it actually does work fine to
> use a high-bit-set character as a delimiter as long as client and server
> encodings are the same LATINx set. We'd be taking away functionality
> for European users for no very good reason.
>
> Is it worth going to the trouble of distinguish same-encoding and
> different-encoding cases and applying a looser check for the former
> case?

I think yes. Seems a good idea.

Even better, however, is fixing the CVS escaping and quoting I
think. Clearly it's a bug.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Mark Dilger <pgsql(at)markdilger(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-18 16:38:06
Message-ID: 20070418163806.GS72669@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 17, 2007 at 02:28:18PM -0400, Tom Lane wrote:
> I doubt that supporting a single multibyte character would be an
> interesting extension --- if we wanted to do anything at all there, we'd
> just generalize the delimiter to be an arbitrary string. But it would
> certainly slow down COPY by some amount, which is an area where you'll
> get push-back for performance losses, so you'd need to make a convincing
> use-case for it.

Couldn't we use a fast code path (what we have now) for the case when
the delimiter is a single byte? That would allow for multi-character
delimiters without penalizing those that don't use them.

As for use case, I worked on migrating some stuff out of a MySQL
database a while ago, and having arbitrary string delimiters would have
made life easier.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Dilger <pgsql(at)markdilger(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: utf8 COPY DELIMITER?
Date: 2007-04-18 17:09:20
Message-ID: 462650C0.1050601@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Tue, Apr 17, 2007 at 02:28:18PM -0400, Tom Lane wrote:
>
>> I doubt that supporting a single multibyte character would be an
>> interesting extension --- if we wanted to do anything at all there, we'd
>> just generalize the delimiter to be an arbitrary string. But it would
>> certainly slow down COPY by some amount, which is an area where you'll
>> get push-back for performance losses, so you'd need to make a convincing
>> use-case for it.
>>
>
> Couldn't we use a fast code path (what we have now) for the case when
> the delimiter is a single byte? That would allow for multi-character
> delimiters without penalizing those that don't use them.
>
> As for use case, I worked on migrating some stuff out of a MySQL
> database a while ago, and having arbitrary string delimiters would have
> made life easier.
>

The first thing to note is that the COPY code is quite complex and
fragile. Personally, I'd want a heck of a lot of convincing to see it
changed, and your use case looks to me like it would be better handled
by preprocessing using a perl script.

Also, if we accept string delimiters on input, we should also allow them
on output.

cheers

andrew