Re: Nasty, propagating POLA violation in COPY CSV HEADER

Lists: pgsql-hackers
From: David Fetter <david(at)fetter(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 15:02:54
Message-ID: 20120620150254.GA11635@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT. It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 15:47:14
Message-ID: 27081.1340207234@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.

This is not "fix a POLA violation". This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 15:48:21
Message-ID: 4FE1F0C5.70704@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/20/2012 11:02 AM, David Fetter wrote:
> Folks,
>
> A co-worker filed a bug against file_fdw where the columns in a
> FOREIGN TABLE were scrambled on SELECT. It turned out that this comes
> from the (yes, it's documented, but since it's documented in a place
> not obviously linked to the bug, it's pretty useless) "feature" of
> COPY CSV HEADER whereby the header line is totally ignored in COPY
> OUT.
>
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.
>
> I'm proposing to make the code change here:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436
>
> and a suitable doc change that talks about reading the header only for
> the purpose of matching column names to columns, and throwing away the
> output as before.
>
> What say?
>

First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other
COPY options are documented. The file_fdw page refers the reader to the
COPY docs for details. Unless you want us to duplicate the entire COPY
docs in the file_fdw page this seems entirely reasonable.

The current behaviour was discussed at some length back when we
implemented the HEADER feature, IIRC, and is quite intentional. I don't
think we should alter the current behaviour, as plenty of people rely on
it, some to my certain knowledge. I do see a reasonable case for adding
a new behaviour which takes notice of the header line, although it's
likely to have plenty of wrinkles.

Reordering columns like you suggest might well have a significant impact
on COPY performance, BTW. Also note that I created the file_text_array
FDW precisely for people who want to be able to cherry pick and reorder
columns. See <https://github.com/adunstan/file_text_array_fdw>

cheers

andrew


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:09:19
Message-ID: 20120620160919.GB11635@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:
> David Fetter <david(at)fetter(dot)org> writes:
> > Rather than being totally ignored in the COPY OUT (CSV HEADER)
> > case, the header line in should be parsed to establish which
> > columns are where and rearranging the output if needed.
>
> This is not "fix a POLA violation". This is a
> non-backwards-compatible new feature, which would have to have a
> switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:18:45
Message-ID: 27682.1340209125@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> OK, new proposal:

> COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
> an option which requires that HEADER be enabled and mandates that the
> column names in the header match the columns coming in.

> Has someone got a better name for this option than
> KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list. (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.) If we're just checking for a match, that
question goes away.

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:43:31
Message-ID: 20120620164331.GD11635@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:
> David Fetter <david(at)fetter(dot)org> writes:
> > OK, new proposal:
>
> > COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
> > an option which requires that HEADER be enabled and mandates that the
> > column names in the header match the columns coming in.
>
> > Has someone got a better name for this option than
> > KEEP_HEADER_COLUMN_NAMES?
>
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.

OK

> In your original proposal, I was rather wondering what should happen
> if the incoming file didn't have the same set of columns called out
> in the COPY command's column list. (That is, while *rearranging*
> the columns might be thought non-astonishing, I'm less convinced
> about a copy operation that inserts or defaults columns differently
> from what the command said should happen.) If we're just checking
> for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t). Just to emphasize, the column
ordering is different in the places where they match.

Would

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now "just work" (possibly with some performance penalty) and

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only "work," i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:50:46
Message-ID: 1340210812-sup-8293@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:
> On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

> > In your original proposal, I was rather wondering what should happen
> > if the incoming file didn't have the same set of columns called out
> > in the COPY command's column list. (That is, while *rearranging*
> > the columns might be thought non-astonishing, I'm less convinced
> > about a copy operation that inserts or defaults columns differently
> > from what the command said should happen.) If we're just checking
> > for a match, that question goes away.
>
> Let's imagine we have a table foo(id serial, t text, n numeric) and a
> CSV file foo.csv with headers (n, t). Just to emphasize, the column
> ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t. How would you tell COPY to discard the
junk columns? Currently it just complains that they are there.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:54:23
Message-ID: 4FE2003F.5040805@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/20/2012 12:18 PM, Tom Lane wrote:
> David Fetter<david(at)fetter(dot)org> writes:
>> OK, new proposal:
>> COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
>> an option which requires that HEADER be enabled and mandates that the
>> column names in the header match the columns coming in.
>> Has someone got a better name for this option than
>> KEEP_HEADER_COLUMN_NAMES?
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.
>
> In your original proposal, I was rather wondering what should happen if
> the incoming file didn't have the same set of columns called out in the
> COPY command's column list. (That is, while *rearranging* the columns
> might be thought non-astonishing, I'm less convinced about a copy
> operation that inserts or defaults columns differently from what the
> command said should happen.) If we're just checking for a match, that
> question goes away.
>
>

In the past people have asked me to have copy use the CSV header line in
place of supplying a column list in the COPY command. I can certainly
see some utility in that, and I think it might achieve what David wants.
Using that scenario it would be an error to supply an explicit column
list and also use the header line. But then I don't think CHECK_HEADER
would be the right name for the option. In any case, specifying a name
before we settle on an exact behaviour seems wrong :-)

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 16:56:52
Message-ID: 4FE200D4.9050706@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
> Another related case: you have a file with headers and columns (n, t,
> x, y, z) but your table only has n and t. How would you tell COPY to
> discard the junk columns? Currently it just complains that they are
> there.

That's one of the main use cases for file_text_array_fdw.

cheers

andrew


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 17:06:10
Message-ID: 4FE20302.3080909@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> In the past people have asked me to have copy use the CSV header line in
> place of supplying a column list in the COPY command. I can certainly
> see some utility in that, and I think it might achieve what David wants.
> Using that scenario it would be an error to supply an explicit column
> list and also use the header line. But then I don't think CHECK_HEADER
> would be the right name for the option. In any case, specifying a name
> before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a "check header" flag is
set. If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a "use header" flag is
set. A column list is created to match the columns from the CSV header.
Of necessity, this will consist of all or some of the columns in the
table. If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes. (3) is more valuable
for regular users who want CSV import to "just work". (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 17:35:25
Message-ID: C4DAC901169B624F933534A26ED7DF310F96B014@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> -----Ursprüngliche Nachricht-----
> Von: pgsql-hackers-owner(at)postgresql(dot)org im Auftrag von Josh Berkus
> Gesendet: Mi 6/20/2012 7:06
> An: pgsql-hackers(at)postgresql(dot)org
> Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

...

> (1) is valuable
> for backwards compatibility, and for cases where the CSV header was
> generated by another program (Excel?) so the column names don't match.

4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

my 2 pences,

Marc Mamin


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 17:40:06
Message-ID: 4FE20AF6.1070702@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>
> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
>
> to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY. At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 18:20:21
Message-ID: 1340216405-sup-3419@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Andrew Dunstan's message of mié jun 20 12:56:52 -0400 2012:
>
> On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
> > Another related case: you have a file with headers and columns (n, t,
> > x, y, z) but your table only has n and t. How would you tell COPY to
> > discard the junk columns? Currently it just complains that they are
> > there.
>
> That's one of the main use cases for file_text_array_fdw.

Ah, great, thanks.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Nasty, propagating POLA violation in COPY CSV HEADER
Date: 2012-06-20 19:22:48
Message-ID: 3037.1340220168@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>
>> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
>>
>> to cover the case when column names do not match.

> Personally, I think that's going way beyond what we want to do with
> COPY.

I agree --- if you know that much about the incoming data, you might as
well just specify the correct column list in the COPY command. This
would only have use if you know somebody used weird column names, but
not the order they put the columns in; which seems like a thin use-case.

The three options Josh enumerated seem reasonably sane to me.

regards, tom lane