Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt

Lists: pgsql-bugspgsql-patches
From: "Ben Gould" <ben(dot)gould(at)free(dot)fr>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #2114: (patch) COPY FROM ... end of copy marker corrupt
Date: 2005-12-14 16:45:22
Message-ID: 20051214164522.32DF0F0B12@svr2.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches


The following bug has been logged online:

Bug reference: 2114
Logged by: Ben Gould
Email address: ben(dot)gould(at)free(dot)fr
PostgreSQL version: 8.1.0
Operating system: Mac OS X 10.4.3
Description: (patch) COPY FROM ... end of copy marker corrupt
Details:

With a table like:

CREATE TABLE test_table (
foo text,
bar text,
baz text
);

Using this format for COPY FROM:

COPY test_table FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
QUOTE AS '\"' ESCAPE AS '\"'

Where the file was generated via:

COPY test_table TO STDOUT WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
QUOTE AS '\"' ESCAPE AS '\"' FORCE QUOTE foo, bar, baz;

I needed this patch:

<<<
--- postgresql-8.1.0.original/src/backend/commands/copy.c 2005-12-13
13:18:16.000000000 +0100
+++ postgresql-8.1.0/src/backend/commands/copy.c 2005-12-13
13:28:28.000000000 +0100
@@ -2531,7 +2531,7 @@
/*
* In CSV mode, we only recognize \. at start of line
*/
- if (c == '\\' && cstate->line_buf.len == 0)
+ if (c == '\\' && !in_quote && cstate->line_buf.len == 0)
{
char c2;
>>>

Because of this error message:

pg_endcopy warning: ERROR: end-of-copy marker corrupt

(We have quoted strings containing things like ..\..\.. in the CSV file
which broke the copy from.)

I was using DBD::Pg as the client library.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Ben Gould <ben(dot)gould(at)free(dot)fr>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker corrupt
Date: 2005-12-27 03:24:40
Message-ID: 200512270324.jBR3OeK11101@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches


Sorry for the delay in responding. I have done research on your bug
report, and the problem seems even worse than you reported. First, a
little background. In non-CSV text mode, the backslash is the escape
character, so any character appearing after a backslash (all 255 of
them) is treated specially, e.g. \{delimiter}, \r, \n, and for our case
here '\.'. (A literal backslash is \\). In CSV mode, the quote is
special, but we don't have 255 special characters after a quote. Only
two double-quotes, "", are special, a literal double-quote.

This behavior gives us problems for specifying the end-of-copy marker,
which is \, in both modes. The big problem is that \. is also a valid
CSV data value (though not a valid non-CSV data value). So, the
solution we came up with was to require \. to appear alone on a line in
CSV mode for it to be treated as end-of-copy. Your idea of using quotes
worked, but it wasn't the right solution. We need to enforce the
alone-on-a-line restriction. Our code had:

if (c == '\\' && cstate->line_buf.len == 0)

The problem with that is the because of the input and _output_
buffering, cstate->line_buf.len could be zero even if we are not on the
first character of a line. In fact, for a typical line, it is zero for
all characters on the line. The proper solution is to introduce a
boolean, first_char_in_line, that we set as we enter the loop and clear
once we process a character.

Looking closer at the code, I see the reason for email comments like
"the copy code is nearing unmaintainability. The CSV/non-CSV code was
already complex, but the buffering additions in 8.1 pushed it over the
edge.

I have restructured the line-reading code in copy.c by:

o merging the CSV/non-CSV functions into a single function
o used macros to centralize and clarify the buffering code
o updated comments
o renamed client_encoding_only to encoding_embeds_ascii
o added a high-bit test to the encoding_embeds_ascii test for
performance
o in CSV mode, allow a backslash followed by a non-period to
continue being processed as a data value

There should be no performance impact from this patch because it is
functionally equivalent. If you apply the patch you will see copy.c is
much clearer in this area now and might suggest additional
optimizations.

I have also attached a 8.1-only patch to fix the CSV \. handling bug
with no code restructuring.

---------------------------------------------------------------------------

Ben Gould wrote:
>
> The following bug has been logged online:
>
> Bug reference: 2114
> Logged by: Ben Gould
> Email address: ben(dot)gould(at)free(dot)fr
> PostgreSQL version: 8.1.0
> Operating system: Mac OS X 10.4.3
> Description: (patch) COPY FROM ... end of copy marker corrupt
> Details:
>
> With a table like:
>
> CREATE TABLE test_table (
> foo text,
> bar text,
> baz text
> );
>
> Using this format for COPY FROM:
>
> COPY test_table FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"'
>
> Where the file was generated via:
>
> COPY test_table TO STDOUT WITH CSV HEADER DELIMITER AS ',' NULL AS 'NULL'
> QUOTE AS '\"' ESCAPE AS '\"' FORCE QUOTE foo, bar, baz;
>
> I needed this patch:
>
> <<<
> --- postgresql-8.1.0.original/src/backend/commands/copy.c 2005-12-13
> 13:18:16.000000000 +0100
> +++ postgresql-8.1.0/src/backend/commands/copy.c 2005-12-13
> 13:28:28.000000000 +0100
> @@ -2531,7 +2531,7 @@
> /*
> * In CSV mode, we only recognize \. at start of line
> */
> - if (c == '\\' && cstate->line_buf.len == 0)
> + if (c == '\\' && !in_quote && cstate->line_buf.len == 0)
> {
> char c2;
> >>>
>
> Because of this error message:
>
> pg_endcopy warning: ERROR: end-of-copy marker corrupt
>
> (We have quoted strings containing things like ..\..\.. in the CSV file
> which broke the copy from.)
>
> I was using DBD::Pg as the client library.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 35.9 KB
unknown_filename text/plain 4.1 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Ben Gould <ben(dot)gould(at)free(dot)fr>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy
Date: 2005-12-27 14:51:08
Message-ID: 43B154DC.8090100@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Bruce Momjian wrote:

> The big problem is that \. is also a valid
>CSV data value (though not a valid non-CSV data value). So, the
>solution we came up with was to require \. to appear alone on a line in
>CSV mode for it to be treated as end-of-copy.
>

According to the docs, that's the way to specify EOD in both text and
CSV mode:

End of data can be represented by a single line containing just
backslash-period.

Your analysis regarding line_buf.len seems correct.

We probably should have a regression test with \. in a CSV field.

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Ben Gould <ben(dot)gould(at)free(dot)fr>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker
Date: 2005-12-27 17:49:13
Message-ID: 200512271749.jBRHnDG07431@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Andrew Dunstan wrote:
>
>
> Bruce Momjian wrote:
>
> > The big problem is that \. is also a valid
> >CSV data value (though not a valid non-CSV data value). So, the
> >solution we came up with was to require \. to appear alone on a line in
> >CSV mode for it to be treated as end-of-copy.
> >
>
> According to the docs, that's the way to specify EOD in both text and
> CSV mode:
>
> End of data can be represented by a single line containing just
> backslash-period.

Right, but in non-CSV mode, we allow \. at the end of any line because
it is unique so I kept that behavior. That is not documented however.

> Your analysis regarding line_buf.len seems correct.
>
> We probably should have a regression test with \. in a CSV field.

Agreed. My test for CSV was simple, just try loading:

x\.
x\.b
\.c

all should load literally, but they fail.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Ben Gould <ben(dot)gould(at)free(dot)fr>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of copy marker
Date: 2005-12-27 18:20:58
Message-ID: 200512271820.jBRIKwU22101@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Bruce Momjian wrote:
> Andrew Dunstan wrote:
> >
> >
> > Bruce Momjian wrote:
> >
> > > The big problem is that \. is also a valid
> > >CSV data value (though not a valid non-CSV data value). So, the
> > >solution we came up with was to require \. to appear alone on a line in
> > >CSV mode for it to be treated as end-of-copy.
> > >
> >
> > According to the docs, that's the way to specify EOD in both text and
> > CSV mode:
> >
> > End of data can be represented by a single line containing just
> > backslash-period.
>
> Right, but in non-CSV mode, we allow \. at the end of any line because
> it is unique so I kept that behavior. That is not documented however.
>
> > Your analysis regarding line_buf.len seems correct.
> >
> > We probably should have a regression test with \. in a CSV field.
>
> Agreed. My test for CSV was simple, just try loading:
>
> x\.
> x\.b
> \.c
>
> all should load literally, but they fail.

OK, original patch applied to HEAD and smaller version to 8.1.X, and
regression test added, now attached.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 1.3 KB

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Ben Gould" <ben(dot)gould(at)free(dot)fr>, "PostgreSQL-patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of
Date: 2005-12-28 03:25:53
Message-ID: BFD745C1.19962%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Bruce,

On 12/27/05 10:20 AM, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> OK, original patch applied to HEAD and smaller version to 8.1.X, and
> regression test added, now attached.

Great, good catch.

Have you tested performance, before and after?

The only good way to test performance is using a fast enough I/O subsystem
that you are CPU-bound, which means >60MB/s of write speed.

I'd be happy to get you an account on one.

- Luke


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Ben Gould <ben(dot)gould(at)free(dot)fr>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #2114: (patch) COPY FROM ... end of
Date: 2005-12-28 03:27:57
Message-ID: 200512280327.jBS3RvC27705@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Luke Lonergan wrote:
> Bruce,
>
> On 12/27/05 10:20 AM, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
>
> > OK, original patch applied to HEAD and smaller version to 8.1.X, and
> > regression test added, now attached.
>
> Great, good catch.
>
> Have you tested performance, before and after?
>
> The only good way to test performance is using a fast enough I/O subsystem
> that you are CPU-bound, which means >60MB/s of write speed.
>
> I'd be happy to get you an account on one.

I don't need to test performance because it is the same code, just with
macros and the two functions merged. I do have an optimization for that
loop but I saw no improvement so I didn't apply it. It was basically to
advance the pointer in a tight look just checking for \r, \n, and \\,
but it seems the larger loop isn't much slower than a tight one.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073