Re: multiline CSV fields

Lists: pgsql-hackerspgsql-patches
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: multiline CSV fields
Date: 2004-11-10 23:10:46
Message-ID: 41929FF6.6090407@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Darcy Buskermolen has drawn my attention to unfortunate behaviour of
COPY CSV with fields containing embedded line end chars if the embedded
sequence isn't the same as those of the file containing the CSV data. In
that case we error out when reading the data in. This means there are
cases where we can produce a CSV data file which we can't read in, which
is not at all pleasant.

Possible approaches to the problem:
. make it a documented limitation
. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that
relaxes some of the restrictions on inconsistent line endings
. escape embedded line end chars

The last really isn't an option, because the whole point of CSVs is to
play with other programs, and my understanding is that those that
understand multiline fields (e.g. Excel) expect them not to be escaped,
and do not produce them escaped.

So right now I'm tossing up in my head between the first two options. Or
maybe there's another solution I haven't thought of.

Thoughts?

cheers

andrew


From: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-11 16:53:00
Message-ID: 25708AC0-3402-11D9-B14C-000A958A3956@patrickbkelly.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:

> The last really isn't an option, because the whole point of CSVs is to
> play with other programs, and my understanding is that those that
> understand multiline fields (e.g. Excel) expect them not to be
> escaped, and do not produce them escaped.
>

Actually, when I try to export a sheet with multi-line cells from
excel, it tells me that this feature is incompatible with the CSV
format and will not include them in the CSV file.

Patrick B. Kelly
------------------------------------------------------
http://patrickbkelly.org


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-11 17:43:48
Message-ID: 4193A4D4.7070902@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Patrick B Kelly wrote:

>
> On Nov 10, 2004, at 6:10 PM, Andrew Dunstan wrote:
>
>> The last really isn't an option, because the whole point of CSVs is
>> to play with other programs, and my understanding is that those that
>> understand multiline fields (e.g. Excel) expect them not to be
>> escaped, and do not produce them escaped.
>>
>
> Actually, when I try to export a sheet with multi-line cells from
> excel, it tells me that this feature is incompatible with the CSV
> format and will not include them in the CSV file.
>
>
>

It probably depends on the version. I have just tested with Excel 2000
on a WinXP machine and it both read and wrote these files.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-11 19:20:19
Message-ID: 10498.1100200819@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Patrick B Kelly wrote:
>> Actually, when I try to export a sheet with multi-line cells from
>> excel, it tells me that this feature is incompatible with the CSV
>> format and will not include them in the CSV file.

> It probably depends on the version. I have just tested with Excel 2000
> on a WinXP machine and it both read and wrote these files.

I'd be inclined to define Excel 2000 as broken, honestly, if it's
writing unescaped newlines as data. To support this would mean throwing
away most of our ability to detect incorrectly formatted CSV files.
A simple error like a missing close quote would look to the machine like
the rest of the file is a single long data line where all the newlines
are embedded in data fields. How likely is it that you'll get a useful
error message out of that? Most likely the error message would point to
the end of the file, or at least someplace well removed from the actual
mistake.

I would vote in favor of removing the current code that attempts to
support unquoted newlines, and waiting to see if there are complaints.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-11 19:56:35
Message-ID: 4193C3F3.9090009@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:

>Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>
>>Patrick B Kelly wrote:
>>
>>
>>>Actually, when I try to export a sheet with multi-line cells from
>>>excel, it tells me that this feature is incompatible with the CSV
>>>format and will not include them in the CSV file.
>>>
>>>
>
>
>
>>It probably depends on the version. I have just tested with Excel 2000
>>on a WinXP machine and it both read and wrote these files.
>>
>>
>
>I'd be inclined to define Excel 2000 as broken, honestly, if it's
>writing unescaped newlines as data. To support this would mean throwing
>away most of our ability to detect incorrectly formatted CSV files.
>A simple error like a missing close quote would look to the machine like
>the rest of the file is a single long data line where all the newlines
>are embedded in data fields. How likely is it that you'll get a useful
>error message out of that? Most likely the error message would point to
>the end of the file, or at least someplace well removed from the actual
>mistake.
>
>I would vote in favor of removing the current code that attempts to
>support unquoted newlines, and waiting to see if there are complaints.
>
>
>
>

This feature was specifically requested when we discussed what sort of
CSVs we would handle.

And it does in fact work as long as the newline style is the same.

I just had an idea. How about if we add a new CSV option MULTILINE. If
absent, then on output we would not output unescaped LF/CR characters
and on input we would not allow fields with embedded unescaped LF/CR
characters. In both cases we could error out for now, with perhaps an
8.1 TODO to provide some other behaviour.

Or we could drop the whole multiline "feature" for now and make the
whole thing an 8.1 item, although it would be a bit of a pity when it
does work in what will surely be the most common case.

cheers

andrew


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-11 20:38:16
Message-ID: 878y98dsuf.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I would vote in favor of removing the current code that attempts to
> support unquoted newlines, and waiting to see if there are complaints.

Uhm. *raises hand*

I agree with your argument but one way or another I have to load these CSVs
I'm given. And like it or not virtually all the CSVs people get are going to
be coming from Excel. So far with 7.4 I've just opened them up in Emacs and
removed the newlines, but it's a royal pain in the arse.

--
greg


From: David Fetter <david(at)fetter(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-11 22:26:38
Message-ID: 20041111222638.GC16397@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, Nov 11, 2004 at 03:38:16PM -0500, Greg Stark wrote:
>
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > I would vote in favor of removing the current code that attempts
> > to support unquoted newlines, and waiting to see if there are
> > complaints.
>
> Uhm. *raises hand*
>
> I agree with your argument but one way or another I have to load
> these CSVs I'm given. And like it or not virtually all the CSVs
> people get are going to be coming from Excel. So far with 7.4 I've
> just opened them up in Emacs and removed the newlines, but it's a
> royal pain in the arse.

Meanwhile, check out dbi-link. It lets you query against DBI data
sources including DBD::Excel :)

http://pgfoundry.org/projects/dbi-link/

Bug reports welcome.

Cheers,
D
--
David Fetter david(at)fetter(dot)org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!


From: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: multiline CSV fields
Date: 2004-11-11 23:03:58
Message-ID: F82E5F5D-3435-11D9-B14C-000A958A3956@patrickbkelly.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Nov 11, 2004, at 2:56 PM, Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> Patrick B Kelly wrote:
>>>
>>>> Actually, when I try to export a sheet with multi-line cells from
>>>> excel, it tells me that this feature is incompatible with the CSV
>>>> format and will not include them in the CSV file.
>>>>
>>
>>
>>> It probably depends on the version. I have just tested with Excel
>>> 2000 on a WinXP machine and it both read and wrote these files.
>>>
>>
>> I'd be inclined to define Excel 2000 as broken, honestly, if it's
>> writing unescaped newlines as data. To support this would mean
>> throwing
>> away most of our ability to detect incorrectly formatted CSV files.
>> A simple error like a missing close quote would look to the machine
>> like
>> the rest of the file is a single long data line where all the newlines
>> are embedded in data fields. How likely is it that you'll get a
>> useful
>> error message out of that? Most likely the error message would point
>> to
>> the end of the file, or at least someplace well removed from the
>> actual
>> mistake.
>>
>> I would vote in favor of removing the current code that attempts to
>> support unquoted newlines, and waiting to see if there are complaints.
>>
>>
>>
>
> This feature was specifically requested when we discussed what sort of
> CSVs we would handle.
>
> And it does in fact work as long as the newline style is the same.
>
> I just had an idea. How about if we add a new CSV option MULTILINE. If
> absent, then on output we would not output unescaped LF/CR characters
> and on input we would not allow fields with embedded unescaped LF/CR
> characters. In both cases we could error out for now, with perhaps an
> 8.1 TODO to provide some other behaviour.
>
> Or we could drop the whole multiline "feature" for now and make the
> whole thing an 8.1 item, although it would be a bit of a pity when it
> does work in what will surely be the most common case.
>

What about just coding a FSM into
backend/commands/copy.c:CopyReadLine() that does not process any flavor
of NL characters when it is inside of a data field?

Patrick B. Kelly
------------------------------------------------------
http://patrickbkelly.org


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: multiline CSV fields
Date: 2004-11-11 23:15:15
Message-ID: 4193F283.2020808@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Patrick B Kelly wrote:

>
>
> What about just coding a FSM into
> backend/commands/copy.c:CopyReadLine() that does not process any
> flavor of NL characters when it is inside of a data field?
>
>
>
It would be a major change - the routine doesn't read data a field at a
time, and has no idea if we are even in CSV mode at all. It would be
rather late in the dev cycle to be making such changes, I suspect.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-11 23:16:49
Message-ID: 12778.1100215009@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Patrick B Kelly <pbk(at)patrickbkelly(dot)org> writes:
> What about just coding a FSM into
> backend/commands/copy.c:CopyReadLine() that does not process any flavor
> of NL characters when it is inside of a data field?

CopyReadLine has no business tracking that. One reason why not is that
it is dealing with data not yet converted out of the client's encoding,
which makes matching to user-specified quote/escape characters
difficult.

regards, tom lane


From: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 02:40:09
Message-ID: 2B41297B-3454-11D9-B14C-000A958A3956@patrickbkelly.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Nov 11, 2004, at 6:16 PM, Tom Lane wrote:

> Patrick B Kelly <pbk(at)patrickbkelly(dot)org> writes:
>> What about just coding a FSM into
>> backend/commands/copy.c:CopyReadLine() that does not process any
>> flavor
>> of NL characters when it is inside of a data field?
>
> CopyReadLine has no business tracking that. One reason why not is that
> it is dealing with data not yet converted out of the client's encoding,
> which makes matching to user-specified quote/escape characters
> difficult.
>
> regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>

I appreciate what you are saying about the encoding and you are, of
course, right but CopyReadLine is already processing the NL characters
and it is doing it without considering the context in which they
appear. Unfortunately, the same character(s) are used for two different
purposes in the files in question. Without considering whether they
appear inside or outside of data fields, CopyReadline will mistake one
for the other and cannot correctly do what it is already trying to do
which is break the input file into lines.

My suggestion is to simply have CopyReadLine recognize these two states
(in-field and out-of-field) and execute the current logic only while in
the second state. It would not be too hard but as you mentioned it is
non-trivial.

Patrick B. Kelly
------------------------------------------------------
http://patrickbkelly.org


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 03:07:47
Message-ID: 41942903.4080208@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Patrick B Kelly wrote:

>
>
>
> My suggestion is to simply have CopyReadLine recognize these two
> states (in-field and out-of-field) and execute the current logic only
> while in the second state. It would not be too hard but as you
> mentioned it is non-trivial.
>
>
>

We don't know what state we expect the end of line to be in until after
we have actually read the line. To know how to treat the end of line on
your scheme we would have to parse as we go rather than after reading
the line as now. Changing this would be not only be non-trivial but
significantly invasive to the code.

cheers

andrew


From: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: multiline CSV fields
Date: 2004-11-12 03:35:07
Message-ID: D97EBB68-345B-11D9-B14C-000A958A3956@patrickbkelly.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Nov 11, 2004, at 10:07 PM, Andrew Dunstan wrote:

>
>
> Patrick B Kelly wrote:
>
>>
>>
>>
>> My suggestion is to simply have CopyReadLine recognize these two
>> states (in-field and out-of-field) and execute the current logic only
>> while in the second state. It would not be too hard but as you
>> mentioned it is non-trivial.
>>
>>
>>
>
> We don't know what state we expect the end of line to be in until
> after we have actually read the line. To know how to treat the end of
> line on your scheme we would have to parse as we go rather than after
> reading the line as now. Changing this would be not only be
> non-trivial but significantly invasive to the code.
>
>

Perhaps I am misunderstanding the code. As I read it the code currently
goes through the input character by character looking for NL and EOF
characters. It appears to be very well structured for what I am
proposing. The section in question is a small and clearly defined loop
which reads the input one character at a time and decides when it has
reached the end of the line or file. Each call of CopyReadLine attempts
to get one more line. I would propose that each time it starts out in
the out-of-field state and the state is toggled by each un-escaped
quote that it encounters in the stream. When in the in-field state, it
would only look for the next un-escaped quote and while in the
out-of-field state, it would execute the existing logic as well as
looking for the next un-escaped quote.

I may not be explaining myself well or I may fundamentally
misunderstand how copy works. I would be happy to code the change and
send it to you for review, if you would be interested in looking it
over and it is felt to be a worthwhile capability.

Patrick B. Kelly
------------------------------------------------------
http://patrickbkelly.org


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 04:47:06
Message-ID: 200411120447.iAC4l6m21290@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Can I see an example of such a failure line?

---------------------------------------------------------------------------

Andrew Dunstan wrote:
>
> Darcy Buskermolen has drawn my attention to unfortunate behaviour of
> COPY CSV with fields containing embedded line end chars if the embedded
> sequence isn't the same as those of the file containing the CSV data. In
> that case we error out when reading the data in. This means there are
> cases where we can produce a CSV data file which we can't read in, which
> is not at all pleasant.
>
> Possible approaches to the problem:
> . make it a documented limitation
> . have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that
> relaxes some of the restrictions on inconsistent line endings
> . escape embedded line end chars
>
> The last really isn't an option, because the whole point of CSVs is to
> play with other programs, and my understanding is that those that
> understand multiline fields (e.g. Excel) expect them not to be escaped,
> and do not produce them escaped.
>
> So right now I'm tossing up in my head between the first two options. Or
> maybe there's another solution I haven't thought of.
>
> Thoughts?
>
> cheers
>
> andrew
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 05:20:58
Message-ID: 15892.1100236858@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Patrick B Kelly <pbk(at)patrickbkelly(dot)org> writes:
> I may not be explaining myself well or I may fundamentally
> misunderstand how copy works.

Well, you're definitely ignoring the character-set-conversion issue.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 09:15:40
Message-ID: 41947F3C.2050906@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


This example should fail on data line 2 or 3 on any platform,
regardless of the platform's line-end convention, although I haven't
tested on Windows.

cheers

andrew

[andrew(at)aloysius inst]$ bin/psql -e -f csverr.sql ; od -c
/tmp/csverrtest.csv
create table csverrtest (a int, b text, c int);
CREATE TABLE
insert into csverrtest values(1,'a',1);
INSERT 122471 1
insert into csverrtest values(2,'foo\r\nbar',2);
INSERT 122472 1
insert into csverrtest values(3,'baz\nblurfl',3);
INSERT 122473 1
insert into csverrtest values(4,'d',4);
INSERT 122474 1
insert into csverrtest values(5,'e',5);
INSERT 122475 1
copy csverrtest to '/tmp/csverrtest.csv' csv;
COPY
truncate csverrtest;
TRUNCATE TABLE
copy csverrtest from '/tmp/csverrtest.csv' csv;
psql:cvserr.sql:9: ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY csverrtest, line 2: "2,"foo"
drop table csverrtest;
DROP TABLE
0000000 1 , a , 1 \n 2 , " f o o \r \n b a
0000020 r " , 2 \n 3 , " b a z \n b l u r
0000040 f l " , 3 \n 4 , d , 4 \n 5 , e ,
0000060 5 \n
0000062
[andrew(at)aloysius inst]$

Bruce Momjian wrote:

>Can I see an example of such a failure line?
>
>---------------------------------------------------------------------------
>
>Andrew Dunstan wrote:
>
>
>>Darcy Buskermolen has drawn my attention to unfortunate behaviour of
>>COPY CSV with fields containing embedded line end chars if the embedded
>>sequence isn't the same as those of the file containing the CSV data. In
>>that case we error out when reading the data in. This means there are
>>cases where we can produce a CSV data file which we can't read in, which
>>is not at all pleasant.
>>
>>Possible approaches to the problem:
>>. make it a documented limitation
>>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that
>>relaxes some of the restrictions on inconsistent line endings
>>. escape embedded line end chars
>>
>>The last really isn't an option, because the whole point of CSVs is to
>>play with other programs, and my understanding is that those that
>>understand multiline fields (e.g. Excel) expect them not to be escaped,
>>and do not produce them escaped.
>>
>>So right now I'm tossing up in my head between the first two options. Or
>>maybe there's another solution I haven't thought of.
>>
>>Thoughts?
>>
>>cheers
>>
>>andrew
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>>
>>
>>
>
>
>


From: Patrick B Kelly <pbk(at)patrickbkelly(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-12 15:45:50
Message-ID: ED80A94A-34C1-11D9-B14C-000A958A3956@patrickbkelly.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Nov 12, 2004, at 12:20 AM, Tom Lane wrote:

> Patrick B Kelly <pbk(at)patrickbkelly(dot)org> writes:
>> I may not be explaining myself well or I may fundamentally
>> misunderstand how copy works.
>
> Well, you're definitely ignoring the character-set-conversion issue.
>

I was not trying to ignore the character set and encoding issues but
perhaps my assumptions are naive or overly optimistic. I realized that
quotes are not as consistent as the NL characters but I was assuming
that some encodings would escape to ASCII or a similar encoding like
JIS Roman that would simplify recognition of the quote character.
Unicode files make recognizing other punctuation like the quote fairly
straightforward and to the naive observer, the code in CopyReadLine as
it is currently written appears to handle multi-byte encodings such as
SJIS that may present characters below 127 in trailing bytes.

As I said, perhaps I was oversimplifying. Is there a regression test
set of input files for that I could review to see all of the supported
encodings?

Patrick B. Kelly
------------------------------------------------------
http://patrickbkelly.org


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 03:32:25
Message-ID: 200411290332.iAT3WPX22551@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


OK, what solutions do we have for this? Not being able to load dumped
data is a serious bug. I have added this to the open items list:

* fix COPY CSV with \r,\n in data

My feeling is that if we are in a quoted string we just process whatever
characters we find, even passing through an EOL. I realize it might not
mark missing quote errors well but that seems minor compared to not
loading valid data.

---------------------------------------------------------------------------

Andrew Dunstan wrote:
>
> This example should fail on data line 2 or 3 on any platform,
> regardless of the platform's line-end convention, although I haven't
> tested on Windows.
>
> cheers
>
> andrew
>
> [andrew(at)aloysius inst]$ bin/psql -e -f csverr.sql ; od -c
> /tmp/csverrtest.csv
> create table csverrtest (a int, b text, c int);
> CREATE TABLE
> insert into csverrtest values(1,'a',1);
> INSERT 122471 1
> insert into csverrtest values(2,'foo\r\nbar',2);
> INSERT 122472 1
> insert into csverrtest values(3,'baz\nblurfl',3);
> INSERT 122473 1
> insert into csverrtest values(4,'d',4);
> INSERT 122474 1
> insert into csverrtest values(5,'e',5);
> INSERT 122475 1
> copy csverrtest to '/tmp/csverrtest.csv' csv;
> COPY
> truncate csverrtest;
> TRUNCATE TABLE
> copy csverrtest from '/tmp/csverrtest.csv' csv;
> psql:cvserr.sql:9: ERROR: literal carriage return found in data
> HINT: Use "\r" to represent carriage return.
> CONTEXT: COPY csverrtest, line 2: "2,"foo"
> drop table csverrtest;
> DROP TABLE
> 0000000 1 , a , 1 \n 2 , " f o o \r \n b a
> 0000020 r " , 2 \n 3 , " b a z \n b l u r
> 0000040 f l " , 3 \n 4 , d , 4 \n 5 , e ,
> 0000060 5 \n
> 0000062
> [andrew(at)aloysius inst]$
>
> Bruce Momjian wrote:
>
> >Can I see an example of such a failure line?
> >
> >---------------------------------------------------------------------------
> >
> >Andrew Dunstan wrote:
> >
> >
> >>Darcy Buskermolen has drawn my attention to unfortunate behaviour of
> >>COPY CSV with fields containing embedded line end chars if the embedded
> >>sequence isn't the same as those of the file containing the CSV data. In
> >>that case we error out when reading the data in. This means there are
> >>cases where we can produce a CSV data file which we can't read in, which
> >>is not at all pleasant.
> >>
> >>Possible approaches to the problem:
> >>. make it a documented limitation
> >>. have a "csv read" mode for backend/commands/copy.c:CopyReadLine() that
> >>relaxes some of the restrictions on inconsistent line endings
> >>. escape embedded line end chars
> >>
> >>The last really isn't an option, because the whole point of CSVs is to
> >>play with other programs, and my understanding is that those that
> >>understand multiline fields (e.g. Excel) expect them not to be escaped,
> >>and do not produce them escaped.
> >>
> >>So right now I'm tossing up in my head between the first two options. Or
> >>maybe there's another solution I haven't thought of.
> >>
> >>Thoughts?
> >>
> >>cheers
> >>
> >>andrew
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
> >>
> >>
> >>
> >
> >
> >
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 03:58:43
Message-ID: 8238.1101700723@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> OK, what solutions do we have for this? Not being able to load dumped
> data is a serious bug.

Which we do not have, because pg_dump doesn't use CSV. I do not think
this is a must-fix, especially not if the proposed fix introduces
inconsistencies elsewhere.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 04:59:39
Message-ID: 200411290459.iAT4xd703639@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > OK, what solutions do we have for this? Not being able to load dumped
> > data is a serious bug.
>
> Which we do not have, because pg_dump doesn't use CSV. I do not think
> this is a must-fix, especially not if the proposed fix introduces
> inconsistencies elsewhere.

Sure, pg_dump doesn't use it but COPY should be able to load anything it
output.

Can this be fixed if we ignore the problem with reporting errors?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
To: <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 07:34:26
Message-ID: 3141.24.211.141.25.1101713666.squirrel@www.dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian said:
> Tom Lane wrote:
>> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> > OK, what solutions do we have for this? Not being able to load
>> > dumped data is a serious bug.
>>
>> Which we do not have, because pg_dump doesn't use CSV. I do not think
>> this is a must-fix, especially not if the proposed fix introduces
>> inconsistencies elsewhere.
>
> Sure, pg_dump doesn't use it but COPY should be able to load anything
> it output.
>
> Can this be fixed if we ignore the problem with reporting errors?
>

When I looked at it I could not see any simple fix that was not worse than
the symptom. If the asymmetry offends you, then we could do as Tom suggested
and rip out the multiline processing completely for now. Personally I would
regard that as a pity, as it would disallow many cases that will work quite
happily as we are, and because this is a feature that was requested when we
did this work. The limitation has been documented - my incliniation would be
to revisit this during the 8.1 dev cycle.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 08:04:38
Message-ID: 10081.1101715478@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> Which we do not have, because pg_dump doesn't use CSV. I do not think
>> this is a must-fix, especially not if the proposed fix introduces
>> inconsistencies elsewhere.

> Sure, pg_dump doesn't use it but COPY should be able to load anything it
> output.

I'd buy into that proposition if CSV showed any evidence of being a
sanely defined format, but it shows every indication of being neither
well-defined, nor self-consistent, nor even particularly portable.
I suggest adjusting your expectations. All I expect from that code is
being able to load the majority of data from the more popular Microsloth
applications. Trying to achieve 100% consistency for corner cases is
just going to interfere with the real use-case for the feature, which is
coping with output from applications that aren't very consistent in the
first place.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 13:00:20
Message-ID: 200411291300.iATD0Ke27741@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> Which we do not have, because pg_dump doesn't use CSV. I do not think
> >> this is a must-fix, especially not if the proposed fix introduces
> >> inconsistencies elsewhere.
>
> > Sure, pg_dump doesn't use it but COPY should be able to load anything it
> > output.
>
> I'd buy into that proposition if CSV showed any evidence of being a
> sanely defined format, but it shows every indication of being neither
> well-defined, nor self-consistent, nor even particularly portable.
> I suggest adjusting your expectations. All I expect from that code is
> being able to load the majority of data from the more popular Microsloth
> applications. Trying to achieve 100% consistency for corner cases is
> just going to interfere with the real use-case for the feature, which is
> coping with output from applications that aren't very consistent in the
> first place.

OK, then should we disallow dumping out data in CVS format that we can't
load? Seems like the least we should do for 8.0.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 13:27:48
Message-ID: 41AB23D4.6040509@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:

>Tom Lane wrote:
>
>
>>Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>>
>>
>>>Tom Lane wrote:
>>>
>>>
>>>>Which we do not have, because pg_dump doesn't use CSV. I do not think
>>>>this is a must-fix, especially not if the proposed fix introduces
>>>>inconsistencies elsewhere.
>>>>
>>>>
>>>Sure, pg_dump doesn't use it but COPY should be able to load anything it
>>>output.
>>>
>>>
>>I'd buy into that proposition if CSV showed any evidence of being a
>>sanely defined format, but it shows every indication of being neither
>>well-defined, nor self-consistent, nor even particularly portable.
>>I suggest adjusting your expectations. All I expect from that code is
>>being able to load the majority of data from the more popular Microsloth
>>applications. Trying to achieve 100% consistency for corner cases is
>>just going to interfere with the real use-case for the feature, which is
>>coping with output from applications that aren't very consistent in the
>>first place.
>>
>>
>
>OK, then should we disallow dumping out data in CVS format that we can't
>load? Seems like the least we should do for 8.0.
>
>
>

As Tom rightly points out, having data make the round trip was not the
goal of the exercise. Excel, for example, has no trouble reading such
data (or at least my installation of it).

Personally I consider CSVs with line end chars embedded in fields to be
broken anyway, but this was something that was specifically mentioned
when we were discussing requirements, which is why I coded for it.

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 13:34:27
Message-ID: 200411291334.iATDYR603642@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> >OK, then should we disallow dumping out data in CVS format that we can't
> >load? Seems like the least we should do for 8.0.
> >
> >
> >
>
> As Tom rightly points out, having data make the round trip was not the
> goal of the exercise. Excel, for example, has no trouble reading such
> data (or at least my installation of it).
>
> Personally I consider CSVs with line end chars embedded in fields to be
> broken anyway, but this was something that was specifically mentioned
> when we were discussing requirements, which is why I coded for it.

OK, I am pretty uncomforable with this but you know this usage better
than I do. Should we issue a warning message stating it will not be
able to be reloaded?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 13:41:26
Message-ID: 200411291341.iATDfQk04439@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:
> Andrew Dunstan wrote:
> > >OK, then should we disallow dumping out data in CVS format that we can't
> > >load? Seems like the least we should do for 8.0.
> > >
> > >
> > >
> >
> > As Tom rightly points out, having data make the round trip was not the
> > goal of the exercise. Excel, for example, has no trouble reading such
> > data (or at least my installation of it).
> >
> > Personally I consider CSVs with line end chars embedded in fields to be
> > broken anyway, but this was something that was specifically mentioned
> > when we were discussing requirements, which is why I coded for it.
>
> OK, I am pretty uncomforable with this but you know this usage better
> than I do. Should we issue a warning message stating it will not be
> able to be reloaded?

Also, can you explain why we can't read across a newline to the next
quote? Is it a problem with the way our code is structured or is it a
logical problem? Someone mentioned multibyte encodings but I don't
understand how that applies here.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 14:00:28
Message-ID: 41AB2B7C.6060901@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:

>Andrew Dunstan wrote:
>
>
>>>OK, then should we disallow dumping out data in CVS format that we can't
>>>load? Seems like the least we should do for 8.0.
>>>
>>>
>>>
>>>
>>>
>>As Tom rightly points out, having data make the round trip was not the
>>goal of the exercise. Excel, for example, has no trouble reading such
>>data (or at least my installation of it).
>>
>>Personally I consider CSVs with line end chars embedded in fields to be
>>broken anyway, but this was something that was specifically mentioned
>>when we were discussing requirements, which is why I coded for it.
>>
>>
>
>OK, I am pretty uncomforable with this but you know this usage better
>than I do. Should we issue a warning message stating it will not be
>able to be reloaded?
>
>

If it bothers you that much. I'd make a flag, cleared at the start of
each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
if the flag is not set then set it and issue the warning.

Longer term I'd like to be able to have a command parameter that
specifies certain fields as multiline and for those relax the line end
matching restriction (and for others forbid multiline altogether). That
would be a TODO for 8.1 though, along with optional special handling for
first line column headings.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 14:23:57
Message-ID: 41AB30FD.306@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:

>
>Also, can you explain why we can't read across a newline to the next
>quote? Is it a problem with the way our code is structured or is it a
>logical problem? Someone mentioned multibyte encodings but I don't
>understand how that applies here.
>
>
>

In a CSV file, each line is a record. Reading across a newline for the
next quote (assuming the next field is quoted) would mean stealing
fields from the next record.

I did see one complaint about missing or extra fields at the end of a
record - I think it is reasonable for us to expect the data to be
rectangular, and not ragged.

(I hope this answers your question - I am not 100% certain I understaood
it).

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-29 15:42:29
Message-ID: 18519.1101742949@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Also, can you explain why we can't read across a newline to the next
> quote? Is it a problem with the way our code is structured or is it a
> logical problem?

It's a structural issue in the sense that we separate the act of
dividing the input into rows from the act of dividing it into columns.
I do not think that separation is wrong however.

regards, tom lane


From: Kris Jurka <books(at)ejurka(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-30 01:36:42
Message-ID: Pine.BSO.4.56.0411292031090.15008@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 29 Nov 2004, Andrew Dunstan wrote:

> Longer term I'd like to be able to have a command parameter that
> specifies certain fields as multiline and for those relax the line end
> matching restriction (and for others forbid multiline altogether). That
> would be a TODO for 8.1 though, along with optional special handling for
> first line column headings.
>

Endlessly extending the COPY command doesn't seem like a winning
proposition to me and I think if we aren't comfortable telling every user
to write a script to pre/post-process the data we should instead provide a
bulk loader/unloader that transforms things to our limited COPY
functionality. There are all kinds of feature requests I've seen
along these lines that would make COPY a million option mess if we try to
support all of it directly.

- skipping header rows
- skipping certain data file columns
- specifying date formats
- ignoring duplicates
- outputting an arbitrary SELECT statement

Kris Jurka


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kris Jurka <books(at)ejurka(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-30 03:25:08
Message-ID: 200411300325.iAU3P8h14433@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Kris Jurka wrote:
>
>
> On Mon, 29 Nov 2004, Andrew Dunstan wrote:
>
> > Longer term I'd like to be able to have a command parameter that
> > specifies certain fields as multiline and for those relax the line end
> > matching restriction (and for others forbid multiline altogether). That
> > would be a TODO for 8.1 though, along with optional special handling for
> > first line column headings.
> >
>
> Endlessly extending the COPY command doesn't seem like a winning
> proposition to me and I think if we aren't comfortable telling every user
> to write a script to pre/post-process the data we should instead provide a
> bulk loader/unloader that transforms things to our limited COPY
> functionality. There are all kinds of feature requests I've seen
> along these lines that would make COPY a million option mess if we try to
> support all of it directly.
>
> - skipping header rows
> - skipping certain data file columns
> - specifying date formats
> - ignoring duplicates
> - outputting an arbitrary SELECT statement

Agreed. There are lots of wishes for COPY and it will become bloated if
we do them all.

I am concerned someone will say, "Oh, I know the CSV format and I might
load the data into another database someday so I will always use CVS"
not knowing it isn't a 100% consistent format. I think we need to
issues a warning if a \r or \n is output by COPY CSV just so people
understand the limitation. We can then reevaluate where we need to go
for 8.1.

Open item updated:

* warn on COPY TO ... CSV with \r,\n in data

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kris Jurka <books(at)ejurka(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-30 03:35:10
Message-ID: 5687.1101785710@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Kris Jurka <books(at)ejurka(dot)com> writes:
> Endlessly extending the COPY command doesn't seem like a winning
> proposition to me and I think if we aren't comfortable telling every user
> to write a script to pre/post-process the data we should instead provide a
> bulk loader/unloader that transforms things to our limited COPY
> functionality. There are all kinds of feature requests I've seen
> along these lines that would make COPY a million option mess if we try to
> support all of it directly.

I agree completely --- personally I'd not have put CSV into the backend
either.

IIRC we already have a TODO item for a separate bulk loader, but no
one's stepped up to the plate yet :-(

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kris Jurka <books(at)ejurka(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-30 03:53:35
Message-ID: 200411300353.iAU3rZr17548@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Kris Jurka <books(at)ejurka(dot)com> writes:
> > Endlessly extending the COPY command doesn't seem like a winning
> > proposition to me and I think if we aren't comfortable telling every user
> > to write a script to pre/post-process the data we should instead provide a
> > bulk loader/unloader that transforms things to our limited COPY
> > functionality. There are all kinds of feature requests I've seen
> > along these lines that would make COPY a million option mess if we try to
> > support all of it directly.
>
> I agree completely --- personally I'd not have put CSV into the backend
> either.

What pushed us was the large number of request for it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kris Jurka <books(at)ejurka(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multiline CSV fields
Date: 2004-11-30 04:34:13
Message-ID: 41ABF845.2060200@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:

>Kris Jurka <books(at)ejurka(dot)com> writes:
>
>
>>Endlessly extending the COPY command doesn't seem like a winning
>>proposition to me and I think if we aren't comfortable telling every user
>>to write a script to pre/post-process the data we should instead provide a
>>bulk loader/unloader that transforms things to our limited COPY
>>functionality. There are all kinds of feature requests I've seen
>>along these lines that would make COPY a million option mess if we try to
>>support all of it directly.
>>
>>
>
>I agree completely --- personally I'd not have put CSV into the backend
>either.
>
>IIRC we already have a TODO item for a separate bulk loader, but no
>one's stepped up to the plate yet :-(
>

IIRC, the way it happened was that a proposal was made to do CSV import/export in a fairly radical way, I countered with a much more modest approach, which was generally accepted and which Bruce and I then implemented, not without some angst (as well as a little sturm und drang).

The advantage of having it in COPY is that it can be done serverside
direct from the file system. For massive bulk loads that might be a
plus, although I don't know what the protocol+socket overhead is. Maybe
it would just be lost in the noise. Certainly I can see some sense in
having COPY deal with straightforward cases and a bulk-load-unload
program in bin to handle the hairier cases. Multiline fields would come
into that category. The bulk-load-unload facility could possibly handle
things other than CSV format too (XML anyone?). The nice thing about an
external program is that it would not have to handle data embedded in an
SQL stream, so the dangers from shifts in newline style, missing quotes,
and the like would be far lower.

We do need to keep things in perspective a bit. The small wrinkle that
has spawned this whole thread will not affect most users of the facility
- and many many users will thanks us for having provided it.

cheers

andrew


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 06:48:35
Message-ID: 87oehf96ik.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Andrew Dunstan <andrew(at)dunslane(dot)net> writes:

> The advantage of having it in COPY is that it can be done serverside direct
> from the file system. For massive bulk loads that might be a plus, although I
> don't know what the protocol+socket overhead is.

Actually even if you use client-side COPY it's *still* more efficient than any
more general client-side alternative.

As Tom pointed out to me a while back, neither the protocol nor libpq allow
for having multiple queries in flight simultaneously. That makes it impossible
to stream large quantities of data to the server efficiently. Each record
requires a round-trip and context switch overhead.

In an ideal world the client should be able to queue up enough records to fill
the socket buffers and allow the kernel to switch to a more batch oriented
context switch mode where the server can process large numbers of records
before switching back to the client. Ideally this would apply to any kind of
query execution.

But as a kind of short cut towards this for bulk loading I'm curious whether
it would be possible to adopt a sort of batch execution mode where a statement
is prepared, then parameters to the statement are streamed to the server in a
kind of COPY mode. It would have to be some format that allowed for embedded
newlines though; there's just no point in an interface that can't handle
arbitrary data.

Personally I find the current CSV support inadequate. It seems pointless to
support CSV if it can't load data exported from Excel, which seems like the
main use case. But I always thought bulk loading should be from some external
application anyways. The problem is that there isn't any interface suitable
for an external application to use.

--
greg


From: Kris Jurka <books(at)ejurka(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 12:27:40
Message-ID: Pine.BSO.4.56.0411300718230.29983@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 30 Nov 2004, Greg Stark wrote:

>
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
> > The advantage of having it in COPY is that it can be done serverside
> > direct from the file system. For massive bulk loads that might be a
> > plus, although I don't know what the protocol+socket overhead is.
>
> Actually even if you use client-side COPY it's *still* more efficient than any
> more general client-side alternative.

The idea would be to still use COPY just from a program that did
additional processing, not as direct SQL.

> As Tom pointed out to me a while back, neither the protocol nor libpq allow
> for having multiple queries in flight simultaneously. That makes it impossible
> to stream large quantities of data to the server efficiently. Each record
> requires a round-trip and context switch overhead.

Multiplexing queries is different than having multiple queries in flight.
You can have multiple queries on the wire now, they are just processed
sequentially.

> In an ideal world the client should be able to queue up enough records to fill
> the socket buffers and allow the kernel to switch to a more batch oriented
> context switch mode where the server can process large numbers of records
> before switching back to the client. Ideally this would apply to any kind of
> query execution.

This is possible now with the V3 protocol (and used by the JDBC driver).
For an executeBatch() on a PreparedStatement, the driver sends a parse
message, then any number of bind/execute pairs, but with a buffered stream
nothing happens until a Sync message is sent and the stream is flushed.
Then it collects the results of all of the executes.

Kris Jurka


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 13:03:48
Message-ID: 41AC6FB4.7090200@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark wrote:

>Personally I find the current CSV support inadequate. It seems pointless to
>support CSV if it can't load data exported from Excel, which seems like the
>main use case.
>
>

OK, I'm starting to get mildly annoyed now. We have identified one
failure case connected with multiline fields. Even in the multiline
case, I expect that the embedded newlines will usually match those of
the CSV file, in which case there will be no failure. It's a very big
step from there to the far more general "can't load data exported from
Excel". Or did you have some other limitation in mind?

FWIW, I don't make a habit of using multiline fields in my spreadsheets
- and some users I have spoken to aren't even aware that you can have
them at all.

cheers

andrew


From: Ben(dot)Young(at)risk(dot)sungard(dot)com
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 13:26:38
Message-ID: OF32B80860.F00446F7-ON80256F5C.0049621F-80256F5C.00496265@risk.sungard.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

>
>
> Greg Stark wrote:
>
> >Personally I find the current CSV support inadequate. It seems
pointless to
> >support CSV if it can't load data exported from Excel, which seems like
the
> >main use case.
> >
> >
>
> OK, I'm starting to get mildly annoyed now. We have identified one
> failure case connected with multiline fields. Even in the multiline
> case, I expect that the embedded newlines will usually match those of
> the CSV file, in which case there will be no failure. It's a very big
> step from there to the far more general "can't load data exported from
> Excel". Or did you have some other limitation in mind?
>
> FWIW, I don't make a habit of using multiline fields in my spreadsheets
> - and some users I have spoken to aren't even aware that you can have
> them at all.
>

I am normally more of a lurker on these lists, but I thought you had
better know
that when we developed CSV import/export for an application at my last
company
we discovered that Excel can't always even read the CSV that _it_ has
output!
(With embedded newlines a particular problem)

It is far more reliable if you output your data as an HTML table, in which
case it
practically always gets it right. Perhaps Postgres could support this as
an import/
export mechanism as I have found it to be far less error prone than CSV?

Cheers,
Ben Young

> cheers
>
> andrew
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Ben(dot)Young(at)risk(dot)sungard(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 16:05:56
Message-ID: 41AC9A64.4090102@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Ben(dot)Young(at)risk(dot)sungard(dot)com wrote:

>
>I am normally more of a lurker on these lists, but I thought you had
>better know
>that when we developed CSV import/export for an application at my last
>company
>we discovered that Excel can't always even read the CSV that _it_ has
>output!
>(With embedded newlines a particular problem)
>
>

Quite so. This is not about being perfect, and there will be failures.
But, absent this problem the feature should work reasonably well.

Note that Excel is not the only kid on the block - if it were there
would be a good case for avoiding CSV anyway and instead reading/writing
the excel files directly. We included support for CSVs because,
nothwithstanding how braindead the format is, is is still widely used in
data exchange.

>It is far more reliable if you output your data as an HTML table, in which
>case it
>practically always gets it right. Perhaps Postgres could support this as
>an import/
>export mechanism as I have found it to be far less error prone than CSV?
>
>

That's probably possibly but rather ugly. A well defined XML format
would be far better (don't we have something like that in contrib?).
Things for a bulk-export facility, I think.

cheers

andrew


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 16:09:15
Message-ID: 87d5xv8gk4.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Andrew Dunstan <andrew(at)dunslane(dot)net> writes:

> FWIW, I don't make a habit of using multiline fields in my spreadsheets - and
> some users I have spoken to aren't even aware that you can have them at all.

Unfortunately I don't get a choice. I offer a field on the web site where
users can upload an excel sheet. Some of the fields of my database are
expected to have multi-line text in them. So I expect virtually all the
uploads to have multi-line fields in them.

So as far as I'm concerned, this import system is simply unusable. I have to
write a program to do the import. Since I was always planning to write such a
program I'm not too disappointed though.

--
greg


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 17:57:58
Message-ID: 200411301757.iAUHvwU22351@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
>
>
> Greg Stark wrote:
>
> >Personally I find the current CSV support inadequate. It seems pointless to
> >support CSV if it can't load data exported from Excel, which seems like the
> >main use case.
> >
> >
>
> OK, I'm starting to get mildly annoyed now. We have identified one
> failure case connected with multiline fields. Even in the multiline
> case, I expect that the embedded newlines will usually match those of
> the CSV file, in which case there will be no failure. It's a very big
> step from there to the far more general "can't load data exported from
> Excel". Or did you have some other limitation in mind?
>
> FWIW, I don't make a habit of using multiline fields in my spreadsheets
> - and some users I have spoken to aren't even aware that you can have
> them at all.

I am wondering if one good solution would be to pre-process the input
stream in copy.c to convert newline to \n and carriage return to \r and
double data backslashes and tell copy.c to interpret those like it does
for normal text COPY files. That way, the changes to copy.c might be
minimal; basically, place a filter in front of the CSV file that cleans
up the input so it can be more easily processed.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 19:11:28
Message-ID: 41ACC5E0.2040507@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian wrote:

>I am wondering if one good solution would be to pre-process the input
>stream in copy.c to convert newline to \n and carriage return to \r and
>double data backslashes and tell copy.c to interpret those like it does
>for normal text COPY files. That way, the changes to copy.c might be
>minimal; basically, place a filter in front of the CSV file that cleans
>up the input so it can be more easily processed.
>
>

This would have to parse the input stream, because you would need to
know which CRs and LFs were part of the data stream and so should be
escaped, and which really ended data lines and so should be left alone.
However, while the idea is basically sound, parsing the stream twice
seems crazy. My argument has been that at this stage in the dev cycle we
should document the limitation, maybe issue a warning as you want, and
make the more invasive code changes to fix it properly in 8.1. If you
don't want to wait, then following your train of thought a bit, ISTM
that the correct solution is a routine for CSV mode that combines the
functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a
genuine and fast fix for Greg's and Darcy's problem.

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multiline CSV fields
Date: 2004-11-30 19:34:06
Message-ID: 200411301934.iAUJY6u04191@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
>
>
> Bruce Momjian wrote:
>
> >I am wondering if one good solution would be to pre-process the input
> >stream in copy.c to convert newline to \n and carriage return to \r and
> >double data backslashes and tell copy.c to interpret those like it does
> >for normal text COPY files. That way, the changes to copy.c might be
> >minimal; basically, place a filter in front of the CSV file that cleans
> >up the input so it can be more easily processed.
> >
> >
>
> This would have to parse the input stream, because you would need to
> know which CRs and LFs were part of the data stream and so should be
> escaped, and which really ended data lines and so should be left alone.
> However, while the idea is basically sound, parsing the stream twice
> seems crazy. My argument has been that at this stage in the dev cycle we
> should document the limitation, maybe issue a warning as you want, and
> make the more invasive code changes to fix it properly in 8.1. If you

OK, right.

> don't want to wait, then following your train of thought a bit, ISTM
> that the correct solution is a routine for CSV mode that combines the
> functions of CopyReadAttributeCSV() and CopyReadLine(). Then we'd have a
> genuine and fast fix for Greg's and Darcy's problem.

We are fine for 8.0, except for the warning, and you think we can fix it
perfectly in 8.1, good.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] multiline CSV fields
Date: 2004-12-02 23:58:29
Message-ID: 41AFAC25.3080405@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I wrote:

>
> If it bothers you that much. I'd make a flag, cleared at the start of
> each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> if the flag is not set then set it and issue the warning.

I didn't realise until Bruce told me just now that I was on the hook for
this. I guess i should keep my big mouth shut. (Yeah, that's gonna
happen ...)

Anyway, here's a tiny patch that does what I had in mind.

cheers

andrew

Attachment Content-Type Size
copy.patch text/x-patch 1.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] multiline CSV fields
Date: 2004-12-03 00:10:51
Message-ID: 9988.1102032651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> + if (!embedded_line_warning && (c == '\n' || c == '\r') )
> + {
> + embedded_line_warning = true;
> + elog(WARNING,
> + "CSV fields with embedded linefeed or carriage return "
> + "characters might not be able to be reimported");
> + }

What about forcibly translating them to the two-character sequences \n
or \r? Or is that not considered a CSV-compatible representation?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] multiline CSV fields
Date: 2004-12-03 00:27:03
Message-ID: 41AFB2D7.9020000@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:

>Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>
>>+ if (!embedded_line_warning && (c == '\n' || c == '\r') )
>>+ {
>>+ embedded_line_warning = true;
>>+ elog(WARNING,
>>+ "CSV fields with embedded linefeed or carriage return "
>>+ "characters might not be able to be reimported");
>>+ }
>>
>>
>
>What about forcibly translating them to the two-character sequences \n
>or \r? Or is that not considered a CSV-compatible representation?
>
>
>
>

Not compatible AFAIK. Certainly not portably. And the warning would
still be true, because we don't do this unescaping on the way back in. I
think the way the comment in the patch suggests and previous emails have
discussed is the right way to go with this - I will attend to it after
we branch ;-)

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Patches (PostgreSQL)" <pgsql-patches(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] multiline CSV fields
Date: 2004-12-03 17:13:26
Message-ID: 200412031713.iB3HDQE04288@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Patch applied. Thanks.

---------------------------------------------------------------------------

Andrew Dunstan wrote:
>
>
> I wrote:
>
> >
> > If it bothers you that much. I'd make a flag, cleared at the start of
> > each COPY, and then where we test for CR or LF in CopyAttributeOutCSV,
> > if the flag is not set then set it and issue the warning.
>
>
>
> I didn't realise until Bruce told me just now that I was on the hook for
> this. I guess i should keep my big mouth shut. (Yeah, that's gonna
> happen ...)
>
> Anyway, here's a tiny patch that does what I had in mind.
>
> cheers
>
> andrew

[ text/x-patch is unsupported, treating like TEXT/PLAIN ]

> Index: copy.c
> ===================================================================
> RCS file: /home/cvsmirror/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.234
> diff -c -r1.234 copy.c
> *** copy.c 6 Nov 2004 17:46:27 -0000 1.234
> --- copy.c 2 Dec 2004 23:34:20 -0000
> ***************
> *** 98,103 ****
> --- 98,104 ----
> static EolType eol_type; /* EOL type of input */
> static int client_encoding; /* remote side's character encoding */
> static int server_encoding; /* local encoding */
> + static bool embedded_line_warning;
>
> /* these are just for error messages, see copy_in_error_callback */
> static bool copy_binary; /* is it a binary copy? */
> ***************
> *** 1190,1195 ****
> --- 1191,1197 ----
> attr = tupDesc->attrs;
> num_phys_attrs = tupDesc->natts;
> attr_count = list_length(attnumlist);
> + embedded_line_warning = false;
>
> /*
> * Get info about the columns we need to process.
> ***************
> *** 2627,2632 ****
> --- 2629,2653 ----
> !use_quote && (c = *test_string) != '\0';
> test_string += mblen)
> {
> + /*
> + * We don't know here what the surrounding line end characters
> + * might be. It might not even be under postgres' control. So
> + * we simple warn on ANY embedded line ending character.
> + *
> + * This warning will disappear when we make line parsing field-aware,
> + * so that we can reliably read in embedded line ending characters
> + * regardless of the file's line-end context.
> + *
> + */
> +
> + if (!embedded_line_warning && (c == '\n' || c == '\r') )
> + {
> + embedded_line_warning = true;
> + elog(WARNING,
> + "CSV fields with embedded linefeed or carriage return "
> + "characters might not be able to be reimported");
> + }
> +
> if (c == delimc || c == quotec || c == '\n' || c == '\r')
> use_quote = true;
> if (!same_encoding)

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073