WIP patch: add (PRE|POST)PROCESSOR options to COPY

Lists: pgsql-hackers
From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 09:13:26
Message-ID: 002101cd9190$081c4140$1854c3c0$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

These options can be specified only when an input or output file is specified.

These options allow to move data between postgres tables and e.g., compressed
files or files on a distributed file system such as Hadoop HDFS. Former
examples are shown below:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# COPY foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh');

$ echo '/bin/gzip > $1' > compress.sh
$ chmod +x compress.sh

postgres=# COPY bar TO '/home/pgsql/bar.csv.gz' WITH (format 'csv',
postprocessor '/home/pgsql/compress.sh');

Attached is a WIP patch. Comments and questions are welcome.

(By using these options, I think it's also possible to develop a variant of
file_fdw, for example a compressed file wrapper and Hadoop HDFS wrapper.)

Thanks,

Best regards,
Etsuro Fujita

Attachment Content-Type Size
copy_options.patch application/octet-stream 20.5 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 13:23:13
Message-ID: 1347542458-sup-8574@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Etsuro Fujita's message of jue sep 13 06:13:26 -0300 2012:
> I'd like to add the following options to the SQL COPY command and the psql \copy
> instruction:
>
> * PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
> from an input file is preprocessed by the program before the data is loaded into
> a postgres table.
> * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
> from a postgres table is postprocessed by the program before the data is stored
> in an output file.

External programs? I don't like the sound of that; there all kinds of
open questions, security concerns, and process management problems.
What if the pre- and postprocessors were functions instead?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 13:25:38
Message-ID: CAFj8pRCXfKX8z-8o6LWjOdzq8b=KWABEfD+fn7S95BEbadPw4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2012/9/13 Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>:
> Excerpts from Etsuro Fujita's message of jue sep 13 06:13:26 -0300 2012:
>> I'd like to add the following options to the SQL COPY command and the psql \copy
>> instruction:
>>
>> * PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
>> from an input file is preprocessed by the program before the data is loaded into
>> a postgres table.
>> * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
>> from a postgres table is postprocessed by the program before the data is stored
>> in an output file.
>
> External programs? I don't like the sound of that; there all kinds of
> open questions, security concerns, and process management problems.
> What if the pre- and postprocessors were functions instead?

+1

this can be solved via pipe and outer processes

Pavel

>
> --
> Álvaro Herrera http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 14:25:21
Message-ID: 26216.1347546321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> I'd like to add the following options to the SQL COPY command and the psql \copy
> instruction:

> * PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
> from an input file is preprocessed by the program before the data is loaded into
> a postgres table.
> * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
> from a postgres table is postprocessed by the program before the data is stored
> in an output file.

The proposed patch causes the external processor programs to execute
with the privileges of the database server, which seems like a pretty
horrid idea. At the very least this would imply limiting use of the
feature to superusers, which greatly restricts its use-case.

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

regards, tom lane


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 17:20:40
Message-ID: m2pq5pu9ef.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> I think it would be a lot better if this were designed so that the
> processor programs executed on client side. Which would probably make
> it not a COPY patch at all, but something in psql.

And pgloader, which already has a part of that feature with the per
column reformating facility.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-13 17:26:49
Message-ID: 50521759.5070900@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 09/13/2012 01:20 PM, Dimitri Fontaine wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> I think it would be a lot better if this were designed so that the
>> processor programs executed on client side. Which would probably make
>> it not a COPY patch at all, but something in psql.
> And pgloader, which already has a part of that feature with the per
> column reformating facility.
>

Yeah, I'd be inclined to say that pre/post processing of this kind is
really a job for specialized clients.

cheers

andrew


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Andrew Dunstan'" <andrew(at)dunslane(dot)net>, "'Dimitri Fontaine'" <dimitri(at)2ndQuadrant(dot)fr>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>, "'Alvaro Herrera'" <alvherre(at)2ndquadrant(dot)com>, "'Pavel Stehule'" <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-14 07:48:50
Message-ID: 001801cd924d$6115fb20$2341f160$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

OK I will redesign the function.

Thanks everyone for the advice!

Best regards,
Etsuro Fujita

> -----Original Message-----
> From: Andrew Dunstan [mailto:andrew(at)dunslane(dot)net]
> Sent: Friday, September 14, 2012 2:27 AM
> To: Dimitri Fontaine
> Cc: Tom Lane; Etsuro Fujita; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY
>
>
> On 09/13/2012 01:20 PM, Dimitri Fontaine wrote:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> I think it would be a lot better if this were designed so that the
> >> processor programs executed on client side. Which would probably make
> >> it not a COPY patch at all, but something in psql.
> > And pgloader, which already has a part of that feature with the per
> > column reformating facility.
> >
>
> Yeah, I'd be inclined to say that pre/post processing of this kind is
> really a job for specialized clients.
>
> cheers
>
> andrew
>


From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-18 07:07:05
Message-ID: 50581D99.3030101@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/13/2012 10:25 PM, Tom Lane wrote:
> I think it would be a lot better if this were designed so that the
> processor programs executed on client side. Which would probably make
> it not a COPY patch at all, but something in psql.

Either that, or allow the pre- and post- processors to be programs
written in a (possibly trusted) PL.

I can't say I really see the point though, when it's easy to just filter
the csv through a pipeline.

--
Craig Ringer


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-18 08:52:09
Message-ID: 009101cd957a$e37189b0$aa549d10$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> From: Craig Ringer [mailto:ringerc(at)ringerc(dot)id(dot)au]

> On 09/13/2012 10:25 PM, Tom Lane wrote:
> > I think it would be a lot better if this were designed so that the
> > processor programs executed on client side. Which would probably make
> > it not a COPY patch at all, but something in psql.

Maybe my explanation was insufficient. Let me add one thing to my earlier
explanation. The submitted patch allows the psql \copy instruction to be
executed like:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh')

In this example, command "/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz" is
executed on client side, by using popen(), and command "COPY foo FROM STDIN WITH
(format 'csv')" is sent to backend. I apologize for not providing you with
enough explanation.

> Either that, or allow the pre- and post- processors to be programs
> written in a (possibly trusted) PL.

I would like to add the hooks not only for the psql \copy instrucntion, but also
for the SQL COPY command, because I think the hooks for the SQL COPY command
would allow to realize variants of contrib/file_fdw such as compressed file FDW
and Hadoop HDFS FDW, etc., which I think would be useful especially for DWH
environments.

Thanks,

Best regards,
Etsuro Fujita


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-18 13:59:37
Message-ID: 14768.1347976777@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> Maybe my explanation was insufficient. Let me add one thing to my earlier
> explanation. The submitted patch allows the psql \copy instruction to be
> executed like:

> $ echo '/bin/gunzip -c $1' > decompress.sh
> $ chmod +x decompress.sh

> postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
> preprocessor '/home/pgsql/decompress.sh')

> In this example, command "/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz" is
> executed on client side, by using popen(), and command "COPY foo FROM STDIN WITH
> (format 'csv')" is sent to backend. I apologize for not providing you with
> enough explanation.

Well, in that case, you've got not only an explanation problem but a
syntax problem, because that syntax is utterly misleading. Anybody
looking at it would think that the "format" option is one of the options
being sent to the backend. The code required to pull it out of there
has got to be grossly overcomplicated (and likely bugprone), too.

I think it would be better to present this as something like

\copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format 'csv'

which would cue any reasonably Unix-savvy person that what's happening
is a popen on the client side. It'd probably be a whole lot less
complicated to implement, too.

regards, tom lane


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-19 02:41:00
Message-ID: 001001cd9610$343cb7d0$9cb62770$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]

> "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > Maybe my explanation was insufficient. Let me add one thing to my earlier
> > explanation. The submitted patch allows the psql \copy instruction to be
> > executed like:
>
> > $ echo '/bin/gunzip -c $1' > decompress.sh
> > $ chmod +x decompress.sh
>
> > postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
> > preprocessor '/home/pgsql/decompress.sh')

> Well, in that case, you've got not only an explanation problem but a
> syntax problem, because that syntax is utterly misleading. Anybody
> looking at it would think that the "format" option is one of the options
> being sent to the backend. The code required to pull it out of there
> has got to be grossly overcomplicated (and likely bugprone), too.
>
> I think it would be better to present this as something like
>
> \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
> format 'csv'
>
> which would cue any reasonably Unix-savvy person that what's happening
> is a popen on the client side. It'd probably be a whole lot less
> complicated to implement, too.

Great!

I have a question. I think it would be also better to extend the syntax for the
SQL COPY command in the same way, ie,

COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
'csv'

Is this okay?

Thanks,

Best regards,
Etsuro Fujita


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-19 03:32:48
Message-ID: 12609.1348025568@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
>> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>> I think it would be better to present this as something like
>> \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
>> format 'csv'

> I have a question. I think it would be also better to extend the syntax for the
> SQL COPY command in the same way, ie,
> COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
> 'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen. My objection was
only that it sounded like you were providing *only* the ability to run
the external processors on the server side. Providing popen on both
sides seems completely sensible.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-19 03:38:33
Message-ID: 12731.1348025913@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
>> I have a question. I think it would be also better to extend the syntax for the
>> SQL COPY command in the same way, ie,
>> COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
>> 'csv'

> Yeah, sure --- that case is already superuser-only, so why not give it
> the option of being a popen instead of just fopen.

BTW, one thought that comes to mind is that such an operation is
extremely likely to fail under environments such as SELinux. That's
not necessarily a reason not to do it, but we should be wary of
promising that it will work everywhere. Probably a documentation note
about this would be enough.

regards, tom lane


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-09-19 06:05:55
Message-ID: 001601cd962c$d4e81780$7eb84680$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]

> I wrote:
> > "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> >> I have a question. I think it would be also better to extend the syntax
> >> for the SQL COPY command in the same way, ie,
> >> COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
> >> format 'csv'
>
> > Yeah, sure --- that case is already superuser-only, so why not give it
> > the option of being a popen instead of just fopen.
>
> BTW, one thought that comes to mind is that such an operation is
> extremely likely to fail under environments such as SELinux. That's
> not necessarily a reason not to do it, but we should be wary of
> promising that it will work everywhere. Probably a documentation note
> about this would be enough.

OK I'll revise the patch.

Thank you for your advice!

Best regards,
Etsuro Fujita


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 11:30:49
Message-ID: 001d01cdc25b$7eee6ea0$7ccb4be0$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> > From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>
> > I wrote:
> > > "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
> > >> I have a question. I think it would be also better to extend the syntax
> > >> for the SQL COPY command in the same way, ie,
> > >> COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
> > >> format 'csv'
> >
> > > Yeah, sure --- that case is already superuser-only, so why not give it
> > > the option of being a popen instead of just fopen.
> >
> > BTW, one thought that comes to mind is that such an operation is
> > extremely likely to fail under environments such as SELinux. That's
> > not necessarily a reason not to do it, but we should be wary of
> > promising that it will work everywhere. Probably a documentation note
> > about this would be enough.
>
> OK I'll revise the patch.

I've revised the patch. In this version a user can specify hooks for pre- and
post-processor executables for COPY and \copy in the follwoing way:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

In the case of the COPY command,

postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
WITH (format 'csv');

Also, in the case of the \copy instruction,

postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
with (format 'csv')

As shown in the example above, I've assumed that the syntax for this option for
e.g., the COPY command is:

COPY table_name FROM 'progname filename' WITH ...
COPY table_name TO 'progname filename' WITH ...

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output. Also, prgoname for COPY OUT is
the user-supplied program that reads standard input and writes to filename taken
as its argument. This makes simple the identification and verification of
progname and filename.

Todo:
* Documentation including documentation note about the limitation for
environments such as SELinux mentioned by Tom.
* More test

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita

Attachment Content-Type Size
copy-popen-20121114.patch application/octet-stream 14.5 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 11:41:20
Message-ID: CA+U5nMLTWhyis-V8eBrZKbuhE3HW=RsQc4rshCcVa_JcU6PNUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13 September 2012 10:13, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I'd like to add the following options to the SQL COPY command and the psql \copy
> instruction:
>
> * PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
> from an input file is preprocessed by the program before the data is loaded into
> a postgres table.
> * POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
> from a postgres table is postprocessed by the program before the data is stored
> in an output file.
>
> These options can be specified only when an input or output file is specified.
>
> These options allow to move data between postgres tables and e.g., compressed
> files or files on a distributed file system such as Hadoop HDFS.

These options look pretty strange to me and I'm not sure they are a good idea.

If we want to read other/complex data, we have Foreign Data Wrappers.

What I think we need is COPY FROM (SELECT....). COPY (query) TO
already exists, so this is just the same thing in the other direction.
Once we have a SELECT statement in both directions we can add any user
defined transforms we wish implemented as database functions.

At present we only support INSERT SELECT ... FROM FDW
which means all the optimisations we've put into COPY are useless with
FDWs. So we need a way to speed up loads from other data sources.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 15:09:56
Message-ID: CAHGQGwHer4czY8Jt0STCGJ5GBNGTTSNBwu-7+K=+U6uK8P8GPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 14, 2012 at 8:30 PM, Etsuro Fujita
<fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I wrote:
>> > From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>>
>> > I wrote:
>> > > "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> writes:
>> > >> I have a question. I think it would be also better to extend the syntax
>> > >> for the SQL COPY command in the same way, ie,
>> > >> COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
>> > >> format 'csv'
>> >
>> > > Yeah, sure --- that case is already superuser-only, so why not give it
>> > > the option of being a popen instead of just fopen.
>> >
>> > BTW, one thought that comes to mind is that such an operation is
>> > extremely likely to fail under environments such as SELinux. That's
>> > not necessarily a reason not to do it, but we should be wary of
>> > promising that it will work everywhere. Probably a documentation note
>> > about this would be enough.
>>
>> OK I'll revise the patch.
>
> I've revised the patch. In this version a user can specify hooks for pre- and
> post-processor executables for COPY and \copy in the follwoing way:
>
> $ echo '/bin/gunzip -c $1' > decompress.sh
> $ chmod +x decompress.sh
>
> In the case of the COPY command,
>
> postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
> WITH (format 'csv');
>
> Also, in the case of the \copy instruction,
>
> postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
> with (format 'csv')
>
> As shown in the example above, I've assumed that the syntax for this option for
> e.g., the COPY command is:
>
> COPY table_name FROM 'progname filename' WITH ...
> COPY table_name TO 'progname filename' WITH ...
>
> Here, progname for COPY IN is the user-supplied program that takes filename as
> its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

> Also, prgoname for COPY OUT is
> the user-supplied program that reads standard input and writes to filename taken
> as its argument. This makes simple the identification and verification of
> progname and filename.
>
> Todo:
> * Documentation including documentation note about the limitation for
> environments such as SELinux mentioned by Tom.
> * More test
>
> Any comments and suggestions are welcomed.

Isn't it dangerous to allow a user to execute external program in
server side via SQL?

Regards,

--
Fujii Masao


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 15:31:25
Message-ID: CA+U5nML+qKX=Z=5WXNBs0mKdcNBNPWrmdDZB2hmMWA3oHFULbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

>> Here, progname for COPY IN is the user-supplied program that takes filename as
>> its argument and that writes on standard output.
>
> What about further extending the COPY IN syntax to the following?
>
> COPY table_name FROM 'progname [ option, ... ]' WITH ...
>
> I'd just like to execute
>
> COPY vmstat_table FROM 'vmstat' WITH ...

I think we should be using FDWs/SRFs here, not inventing new
syntax/architectures for executing external code, so -1 from me.

We can already do
INSERT table SELECT * FROM fdw;
with any logic for generating data lives inside an FDW or SRF.

If we want it in COPY we can have syntax like this...
COPY table FROM (SELECT * FROM fdw)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 15:52:59
Message-ID: CAHGQGwGx75+9+EMaQERXsdanrXFJq3JB6vCMkDsJxY=0aCQ8Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 15, 2012 at 12:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
>>> Here, progname for COPY IN is the user-supplied program that takes filename as
>>> its argument and that writes on standard output.
>>
>> What about further extending the COPY IN syntax to the following?
>>
>> COPY table_name FROM 'progname [ option, ... ]' WITH ...
>>
>> I'd just like to execute
>>
>> COPY vmstat_table FROM 'vmstat' WITH ...
>
> I think we should be using FDWs/SRFs here, not inventing new
> syntax/architectures for executing external code, so -1 from me.
>
> We can already do
> INSERT table SELECT * FROM fdw;
> with any logic for generating data lives inside an FDW or SRF.
>
> If we want it in COPY we can have syntax like this...
> COPY table FROM (SELECT * FROM fdw)

New syntax looks attractive to me because it's easy to use that.
It's not easy to implement the FDW for the external program which
a user wants to execute.

Of course if someone implements something like any_external_program_fdw,
I would change my mind..

Regards,

--
Fujii Masao


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 15:55:51
Message-ID: 20121114155551.GD5161@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs escribió:
> On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> >> Here, progname for COPY IN is the user-supplied program that takes filename as
> >> its argument and that writes on standard output.
> >
> > What about further extending the COPY IN syntax to the following?
> >
> > COPY table_name FROM 'progname [ option, ... ]' WITH ...
> >
> > I'd just like to execute
> >
> > COPY vmstat_table FROM 'vmstat' WITH ...
>
> I think we should be using FDWs/SRFs here, not inventing new
> syntax/architectures for executing external code, so -1 from me.

Hmm, but then you are forced to write C code, whereas the "external
program" proposal could have you writing a only shell script instead.
So there is some merit to this idea ... though we could have a
"pipe_fdw" that could let you specify an arbitrary program to run.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:20:59
Message-ID: 20295.1352910059@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> Simon Riggs escribi:
>> On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> Here, progname for COPY IN is the user-supplied program that takes filename as
>>> its argument and that writes on standard output.

>> I think we should be using FDWs/SRFs here, not inventing new
>> syntax/architectures for executing external code, so -1 from me.

> Hmm, but then you are forced to write C code, whereas the "external
> program" proposal could have you writing a only shell script instead.

I disagree with Simon's objection also, because neither reading from
nor writing to an external program is likely to fit the model of
reading/updating a table very well. For instance, there's no good
reason to suppose that reading twice will give the same results. So
force-fitting this usage into the FDW model is not going to work well.

Nor do I really see the argument why a "pipe_fdw" module is cleaner
than a "COPY TO/FROM pipe" feature.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:32:58
Message-ID: 50A3C7BA.7030000@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/14/2012 11:20 AM, Tom Lane wrote:
> I disagree with Simon's objection also, because neither reading from
> nor writing to an external program is likely to fit the model of
> reading/updating a table very well. For instance, there's no good
> reason to suppose that reading twice will give the same results. So
> force-fitting this usage into the FDW model is not going to work well.
>
> Nor do I really see the argument why a "pipe_fdw" module is cleaner
> than a "COPY TO/FROM pipe" feature.
>
>

Yeah, I agree, although the syntax looks a bit unclean.

Maybe something like

COPY foo FROM wherever WITH (FILTER '/path/to/program')

might work better. You'd hook up the source to the filter as its stdin
and read its stdout. Not sure what we'd do for \copy though.

cheers

andrew


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:33:33
Message-ID: CA+U5nMLnAgAndna9FmOSxcoKoPKfOuOwZ+M6S=4kzsJ-mr7WvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 November 2012 16:20, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
>> Simon Riggs escribió:
>>> On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> Here, progname for COPY IN is the user-supplied program that takes filename as
>>>> its argument and that writes on standard output.
>
>>> I think we should be using FDWs/SRFs here, not inventing new
>>> syntax/architectures for executing external code, so -1 from me.
>
>> Hmm, but then you are forced to write C code, whereas the "external
>> program" proposal could have you writing a only shell script instead.
>
> I disagree with Simon's objection also, because neither reading from
> nor writing to an external program is likely to fit the model of
> reading/updating a table very well. For instance, there's no good
> reason to suppose that reading twice will give the same results. So
> force-fitting this usage into the FDW model is not going to work well.
>
> Nor do I really see the argument why a "pipe_fdw" module is cleaner
> than a "COPY TO/FROM pipe" feature.

Perhaps not cleaner, but we do need

COPY table FROM (SELECT * FROM foo)

So we will then have both ways.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:36:02
Message-ID: CA+U5nM+KYH2UfZHdwiLoKw4p-vh_e-f5aO9x0URw0CZFUuJkcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> What about further extending the COPY IN syntax to the following?
>
> COPY table_name FROM 'progname [ option, ... ]' WITH ...
>
> I'd just like to execute
>
> COPY vmstat_table FROM 'vmstat' WITH ...

If we go ahead with this, I think it needs additional keyword to
indicate that we will execute the file rather than read from it. I
don't think we should rely on whether the file is executable or not to
determine how we should treat it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:39:44
Message-ID: 20745.1352911184@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Yeah, I agree, although the syntax looks a bit unclean.

Oh, I had not looked at the syntax closely. I agree, that basically
sucks: it's overcomplicated and under-featured, because you can't
control the actual program command line very conveniently. Nor do I see
a reason to force this into the model of "program filtering a specific
file". What happened to the previous proposal of treating the COPY
target as a pipe specification, ie

COPY table FROM 'some command line |';
COPY table TO '| some command line';

> Not sure what we'd do for \copy though.

Adding a pipe symbol to the target works exactly the same for \copy.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:46:55
Message-ID: 20901.1352911615@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 14 November 2012 15:09, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> I'd just like to execute
>> COPY vmstat_table FROM 'vmstat' WITH ...

> If we go ahead with this, I think it needs additional keyword to
> indicate that we will execute the file rather than read from it. I
> don't think we should rely on whether the file is executable or not to
> determine how we should treat it.

Agreed, and there's also the question of passing switches etc to the
program, so the string can't be a bare file name anyway. I proposed
pipe symbols (|) in the string previously, but if you find that too
Unix-centric I suppose we could do

COPY TABLE FROM PROGRAM 'command line';
COPY TABLE TO PROGRAM 'command line';

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:50:56
Message-ID: 50A3CBF0.50205@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/14/2012 11:39 AM, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Yeah, I agree, although the syntax looks a bit unclean.
> Oh, I had not looked at the syntax closely. I agree, that basically
> sucks: it's overcomplicated and under-featured, because you can't
> control the actual program command line very conveniently. Nor do I see
> a reason to force this into the model of "program filtering a specific
> file". What happened to the previous proposal of treating the COPY
> target as a pipe specification, ie
>
> COPY table FROM 'some command line |';
> COPY table TO '| some command line';
>

I'd like to be able to filter STDIN if possible - with this syntax how
is COPY going to know to hook up STDIN to the program?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 16:56:57
Message-ID: 21137.1352912217@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 11/14/2012 11:39 AM, Tom Lane wrote:
>> What happened to the previous proposal of treating the COPY
>> target as a pipe specification, ie

> I'd like to be able to filter STDIN if possible - with this syntax how
> is COPY going to know to hook up STDIN to the program?

Huh? That's fairly nonsensical for the backend-side case; there's no
way that stdin (or stdout) of a backend is going to connect anywhere
useful for this purpose. As for doing it on the psql side (\copy),
I think it would be more or less automatic. If you do say

foo | psql -c "\copy tab from 'bar |'" dbname

then bar is going to inherit psql's stdin, which is coming from foo.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 17:18:19
Message-ID: 50A3D25B.7070008@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/14/2012 11:56 AM, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> On 11/14/2012 11:39 AM, Tom Lane wrote:
>>> What happened to the previous proposal of treating the COPY
>>> target as a pipe specification, ie
>> I'd like to be able to filter STDIN if possible - with this syntax how
>> is COPY going to know to hook up STDIN to the program?
> Huh? That's fairly nonsensical for the backend-side case; there's no
> way that stdin (or stdout) of a backend is going to connect anywhere
> useful for this purpose. As for doing it on the psql side (\copy),
> I think it would be more or less automatic. If you do say
>
> foo | psql -c "\copy tab from 'bar |'" dbname
>
> then bar is going to inherit psql's stdin, which is coming from foo.
>
>

Why does it make less sense on the backend than COPY foo FROM STDIN ?
Why shouldn't I want to be able to filter or transform the input? I have
a client with a pretty complex backend-driven ETL tool. One of the
annoying things about it is that we have to transfer the file to the
backend before we can process it. I can imagine this leading to a
similar annoyance.

cheers

andrew


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 19:05:47
Message-ID: 50A3EB8B.3040003@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/14/12 11:50 AM, Andrew Dunstan wrote:
>> COPY table FROM 'some command line |';
>> COPY table TO '| some command line';
>>
>
>
> I'd like to be able to filter STDIN if possible - with this syntax how
> is COPY going to know to hook up STDIN to the program?

Why don't you filter the data before it gets to stdin? Some program is
feeding the data to "stdin" on the client side. Why doesn't that do the
filtering? I don't see a large advantage in having the data be sent
unfiltered to the server and having the server do the filtering.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 19:22:11
Message-ID: 50A3EF63.3070405@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/14/2012 02:05 PM, Peter Eisentraut wrote:
> On 11/14/12 11:50 AM, Andrew Dunstan wrote:
>>> COPY table FROM 'some command line |';
>>> COPY table TO '| some command line';
>>>
>>
>> I'd like to be able to filter STDIN if possible - with this syntax how
>> is COPY going to know to hook up STDIN to the program?
> Why don't you filter the data before it gets to stdin? Some program is
> feeding the data to "stdin" on the client side. Why doesn't that do the
> filtering? I don't see a large advantage in having the data be sent
> unfiltered to the server and having the server do the filtering.

Centralization of processing would be one obvious reason. I don't really
see why the same reasoning doesn't apply on the backend. You could just
preprocess the input before calling COPY (via a plperlu function for
example). If we're going to have filtering functionality then it should
be as general as possible, ISTM. But I seem to be alone in this, so I
won't push it.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 19:37:52
Message-ID: 23999.1352921872@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 11/14/2012 02:05 PM, Peter Eisentraut wrote:
>> Why don't you filter the data before it gets to stdin? Some program is
>> feeding the data to "stdin" on the client side. Why doesn't that do the
>> filtering? I don't see a large advantage in having the data be sent
>> unfiltered to the server and having the server do the filtering.

> Centralization of processing would be one obvious reason.

If I understand correctly, what you're imagining is that the client
sources data to a COPY FROM STDIN type of command, then the backend
pipes that out to stdin of some filtering program, which it then reads
the stdout of to get the data it processes and stores.

We could in principle make that work, but there are some pretty serious
implementation problems: popen doesn't do this so we'd have to cons up
our own fork and pipe setup code, and we would have to write a bunch of
asynchronous processing logic to account for the possibility that the
filter program doesn't return data in similar-size chunk to what it
reads. (IOW, it will never be clear when to try to read data from the
filter and when to try to write data to it.)

I think it's way too complicated for the amount of functionality you'd
get. As Peter says, there's no strong reason not to do such processing
on the client side. In fact there are pretty strong reasons to prefer
to do it there, like not needing database superuser privilege to invoke
the filter program.

What I'm imagining is a very very simple addition to COPY that just
allows it to execute popen() instead of fopen() to read or write the
data source/sink. What you suggest would require hundreds of lines and
create many opportunities for new bugs.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-14 20:15:38
Message-ID: 50A3FBEA.7010506@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 11/14/2012 02:37 PM, Tom Lane wrote:

> What I'm imagining is a very very simple addition to COPY that just
> allows it to execute popen() instead of fopen() to read or write the
> data source/sink. What you suggest would require hundreds of lines and
> create many opportunities for new bugs.
>
>

That's certainly a better answer than any I've had. I accept the reasoning.

cheers

andrew


From: Craig Ringer <craig(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-15 02:01:54
Message-ID: 50A44D12.1010508@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/15/2012 12:46 AM, Tom Lane wrote:
>
> Agreed, and there's also the question of passing switches etc to the
> program, so the string can't be a bare file name anyway. I proposed
> pipe symbols (|) in the string previously, but if you find that too
> Unix-centric I suppose we could do
>
> COPY TABLE FROM PROGRAM 'command line';
> COPY TABLE TO PROGRAM 'command line';

I'd strongly prefer that from a security standpoint. I intensely dislike
the idea that COPY will change from a command that can at worst expose
data to a command that can execute programs. It means existing security
decisions in applications that use it must be re-evaluated ... and most
won't be. Also, it isn't too hard to check the command string for pipe
chars, but experience has taught over and over with SQL injection, shell
metacharacter exploits, XSS, etc that magic characters that you must
check for are a bad idea, and it's much better to have separate syntax
(like parameterized statements) rather than magic strings.

Additionally, the pipe design appears to presume the presence of a shell
and the desirability of using it. I don't think either assumption is
sensible.

Windows has a shell of sorts (cmd.exe) but its behaviour is different
with regards to quoting and it can be tricky to produce commands that
work under both a UNIX shell and cmd.exe .

More importantly, the shell provides fun opportunities for unexpected
side-effects via metacharacters, leading to undesired behaviour or even
exploits. It's IMO strongly preferable to use an argument vector and
direct execution, so the shell never gets involved.

How about:

COPY ... FROM PROGRAM '/bin/my_program', '$notavariable', '$(rm -rf
$HOME)';

or:

COPY ... FROM (PROGRAM '/bin/my_program', ARGUMENTS
('$notavariable', '$(rm -rf $HOME)') );

?

Something extensible would be good, as somebody is *inevitably* going to
ask "so how do I set environment variables before I call the command"
and "how do I control which return values are considered success".

--
Craig Ringer

>
> regards, tom lane
>
>

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Craig Ringer <craig(at)2ndQuadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-15 02:19:26
Message-ID: 2213.1352945966@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Craig Ringer <craig(at)2ndQuadrant(dot)com> writes:
> On 11/15/2012 12:46 AM, Tom Lane wrote:
>> Agreed, and there's also the question of passing switches etc to the
>> program, so the string can't be a bare file name anyway. I proposed
>> pipe symbols (|) in the string previously, but if you find that too
>> Unix-centric I suppose we could do
>>
>> COPY TABLE FROM PROGRAM 'command line';
>> COPY TABLE TO PROGRAM 'command line';

> I'd strongly prefer that from a security standpoint.

That's a reasonable concern.

> Additionally, the pipe design appears to presume the presence of a shell
> and the desirability of using it. I don't think either assumption is
> sensible.

I disagree very very strongly with that. If we prevent use of shell
syntax, we will lose a lot of functionality, for instance

copy ... from program 'foo <somefile'
copy ... from program 'foo | bar'

unless you're imagining that we will reimplement a whole lot of that
same shell syntax for ourselves. (And no, I don't care whether the
Windows syntax is exactly the same or not. The program name/path is
already likely to vary across systems, so it's pointless to suppose that
use of the feature would be 100% portable if only we lobotomized it.)

> More importantly, the shell provides fun opportunities for unexpected
> side-effects via metacharacters, leading to undesired behaviour or even
> exploits.

So? You're already handing the keys to the kingdom to anybody who can
control the contents of that command line, even if it's only to point at
the wrong program. And one man's "unexpected side-effect" is another
man's "essential feature", as in my examples above.

regards, tom lane


From: Craig Ringer <craig(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-15 03:14:13
Message-ID: 50A45E05.3030201@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/15/2012 10:19 AM, Tom Lane wrote:
>
> I disagree very very strongly with that. If we prevent use of shell
> syntax, we will lose a lot of functionality, for instance
>
> copy ... from program 'foo <somefile'
> copy ... from program 'foo | bar'
>
> unless you're imagining that we will reimplement a whole lot of that
> same shell syntax for ourselves. (And no, I don't care whether the
> Windows syntax is exactly the same or not. The program name/path is
> already likely to vary across systems, so it's pointless to suppose that
> use of the feature would be 100% portable if only we lobotomized it.)

That's reasonable - and it isn't worth making people jump through hoops
with ('bash','-c','/some/command < infile') .

> So? You're already handing the keys to the kingdom to anybody who can
> control the contents of that command line, even if it's only to point at
> the wrong program. And one man's "unexpected side-effect" is another
> man's "essential feature", as in my examples above.

That's true if they're controlling the whole command, not so much if
they just provide a file name. I'm just worried that people will use it
without thinking deeply about the consequences, just like they do with
untrusted client input in SQL injection attacks.

I take you point about wanting more than just the execve()-style
invocation. I'd still like to see a way to invoke the command without
having the shell involved, though; APIs to invoke external programs seem
to start out with a version that launches via the shell then quickly
grow more controlled argument-vector versions.

There's certainly room for a quick'n'easy COPY ... FROM PROGRAM ('cmd1 |
cmd2 | tee /tmp/log') . At this point all I think is really vital is to
make copy-with-exec *syntactically different* to plain COPY, and to
leave room for extending the syntax for environment, separate args
vector, etc when they're called for. Like VACUUM, where VACUUM VERBOSE
ANALYZE became VACUUM (VERBOSE, ANALYZE) to make room for (BUFFERS), etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-15 19:13:13
Message-ID: CA+TgmobTvTvphRzDWeqA+SsGucrPnaOg9=Fj93JZLi-0an4ahQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 14, 2012 at 10:14 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> So? You're already handing the keys to the kingdom to anybody who can
>> control the contents of that command line, even if it's only to point at
>> the wrong program. And one man's "unexpected side-effect" is another
>> man's "essential feature", as in my examples above.
>
> That's true if they're controlling the whole command, not so much if
> they just provide a file name. I'm just worried that people will use it
> without thinking deeply about the consequences, just like they do with
> untrusted client input in SQL injection attacks.

Yeah. If we're going to do this at all, and I'm not convinced it's
worth the work, I think it's definitely good to support a variant
where we specify exactly the things that will be passed to exec().
There's just too many ways to accidentally shoot yourself in the foot
otherwise. If we want to have an option that lets people shoot
themselves in the foot, that's fine. But I think we'd be smart not to
make that the only option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-15 19:35:53
Message-ID: 20767.1353008153@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Yeah. If we're going to do this at all, and I'm not convinced it's
> worth the work, I think it's definitely good to support a variant
> where we specify exactly the things that will be passed to exec().
> There's just too many ways to accidentally shoot yourself in the foot
> otherwise. If we want to have an option that lets people shoot
> themselves in the foot, that's fine. But I think we'd be smart not to
> make that the only option.

[ shrug... ] Once again, that will turn this from a ten-line patch
into hundreds of lines (and some more, different, hundreds of lines
for Windows I bet), with a corresponding growth in the opportunities
for bugs, for a benefit that's at best debatable.

The biggest problem this patch has had from the very beginning is
overdesign, and this is more of the same. Let's please just define the
feature as "popen, not fopen, the given string" and have done. You can
put all the warning verbiage you want in the documentation. (But note
that the server-side version would be superuser-only in any flavor of
the feature.)

regards, tom lane


From: Craig Ringer <craig(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-16 04:38:18
Message-ID: 50A5C33A.2030107@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/16/2012 03:35 AM, Tom Lane wrote:

> The biggest problem this patch has had from the very beginning is
> overdesign, and this is more of the same. Let's please just define the
> feature as "popen, not fopen, the given string" and have done. You can
> put all the warning verbiage you want in the documentation. (But note
> that the server-side version would be superuser-only in any flavor of
> the feature.)

I concede that as server-side COPY is superuser-only already it doesn't
offer the same potential for attack that it otherwise would. If
applications take unchecked file system paths from users and feed them
into a superuser command they already have security problems.

I'd still be much happier to have COPY ... FROM PROGRAM - or something -
to clearly make the two different, for clarity as much as security.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Craig Ringer <craig(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-16 04:55:36
Message-ID: 4313.1353041736@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Craig Ringer <craig(at)2ndQuadrant(dot)com> writes:
> I'd still be much happier to have COPY ... FROM PROGRAM - or something -
> to clearly make the two different, for clarity as much as security.

I don't object to using a PROGRAM keyword rather than something inside
the string to select this behavior.

regards, tom lane


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Cc: "'Craig Ringer'" <craig(at)2ndquadrant(dot)com>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>, "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-16 09:41:44
Message-ID: 001101cdc3de$97348510$c59d8f30$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > Yeah. If we're going to do this at all, and I'm not convinced it's
> > worth the work, I think it's definitely good to support a variant
> > where we specify exactly the things that will be passed to exec().
> > There's just too many ways to accidentally shoot yourself in the foot
> > otherwise. If we want to have an option that lets people shoot
> > themselves in the foot, that's fine. But I think we'd be smart not to
> > make that the only option.
>
> [ shrug... ] Once again, that will turn this from a ten-line patch
> into hundreds of lines (and some more, different, hundreds of lines
> for Windows I bet), with a corresponding growth in the opportunities
> for bugs, for a benefit that's at best debatable.
>
> The biggest problem this patch has had from the very beginning is
> overdesign, and this is more of the same. Let's please just define the
> feature as "popen, not fopen, the given string" and have done. You can
> put all the warning verbiage you want in the documentation. (But note
> that the server-side version would be superuser-only in any flavor of
> the feature.)

Agreed. I'll reimplement the feature using the PROGRAM keyword:

> COPY TABLE FROM PROGRAM 'command line';
> COPY TABLE TO PROGRAM 'command line';

Sorry for the late response.

Best regards,
Etsuro Fujita


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-19 17:31:00
Message-ID: CA+TgmoZ3BBGrgv+rGeLbx623v-1DaxpfR9_hoo40m+D4bHBSGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 15, 2012 at 2:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Yeah. If we're going to do this at all, and I'm not convinced it's
>> worth the work, I think it's definitely good to support a variant
>> where we specify exactly the things that will be passed to exec().
>> There's just too many ways to accidentally shoot yourself in the foot
>> otherwise. If we want to have an option that lets people shoot
>> themselves in the foot, that's fine. But I think we'd be smart not to
>> make that the only option.
>
> [ shrug... ] Once again, that will turn this from a ten-line patch
> into hundreds of lines (and some more, different, hundreds of lines
> for Windows I bet), with a corresponding growth in the opportunities
> for bugs, for a benefit that's at best debatable.
>
> The biggest problem this patch has had from the very beginning is
> overdesign, and this is more of the same. Let's please just define the
> feature as "popen, not fopen, the given string" and have done.

I just don't agree with that. popen() is to security holes as cars
are to alcohol-related fatalities. In each case, the first one
doesn't directly cause the second one; but it's a pretty darn powerful
enabler. Your proposed solution won't force people to write insecure
applications; it'll just make it much more likely that they will do so
... after which, presumably, you'll tell them it's their own darn
fault for using the attractive nuisance. The list of security
vulnerabilities that are the result of insufficiently careful
validation of strings passed to popen() is extremely long. If we give
people a feature that can only be leveraged via popen(), the chances
that someone will thereby open a security hole are indistinguishable
from 1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-19 17:54:55
Message-ID: 6648.1353347695@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Nov 15, 2012 at 2:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The biggest problem this patch has had from the very beginning is
>> overdesign, and this is more of the same. Let's please just define the
>> feature as "popen, not fopen, the given string" and have done.

> ... If we give
> people a feature that can only be leveraged via popen(), the chances
> that someone will thereby open a security hole are indistinguishable
> from 1.

You are absolutely right that this feature is a security risk, but it
will be one whether it exposes popen() or only exec(). I do not believe
that the incremental gain in security from disallowing shell notation
is worth either the loss of functionality or the amount of added effort
(and added bugs, some of which will be security issues in themselves)
we'd need to write it that way.

The correct response to the security risks is to (a) make it
superuser-only and (b) document that it's a seriously bad idea to allow
the argument string to come from any untrusted sources. Please note
that we'd have to do these same things with an exec-based patch.

regards, tom lane


From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Cc: "'Craig Ringer'" <craig(at)2ndquadrant(dot)com>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>, "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-22 12:10:08
Message-ID: 008801cdc8aa$5062cc20$f1286460$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> > The biggest problem this patch has had from the very beginning is
> > overdesign, and this is more of the same. Let's please just define the
> > feature as "popen, not fopen, the given string" and have done. You can
> > put all the warning verbiage you want in the documentation. (But note
> > that the server-side version would be superuser-only in any flavor of
> > the feature.)
>
> Agreed. I'll reimplement the feature using the PROGRAM keyword:
>
> > COPY TABLE FROM PROGRAM 'command line';
> > COPY TABLE TO PROGRAM 'command line';

I've reimplemented the feature. Attached is an updated version of the patch.

Todo:
* More documents
* More tests

Any comments are welcomed.

Thanks,

Best regards,
Etsuro Fujita

Attachment Content-Type Size
copy-popen-20121122.patch application/octet-stream 33.3 KB

From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Cc: "'Craig Ringer'" <craig(at)2ndquadrant(dot)com>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>, "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, "'Craig Ringer'" <ringerc(at)ringerc(dot)id(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY
Date: 2012-11-26 12:43:36
Message-ID: 009801cdcbd3$a6ee8580$f4cb9080$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> > I'll reimplement the feature using the PROGRAM keyword:
> >
> > > COPY TABLE FROM PROGRAM 'command line';
> > > COPY TABLE TO PROGRAM 'command line';
>
> I've reimplemented the feature. Attached is an updated version of the patch.

I fixed bugs in the previous version of the patch. Please find attached an
updated version of the patch.

Thanks,

Best regards,
Etsuro Fujita

Attachment Content-Type Size
copy-popen-20121126.patch application/octet-stream 36.7 KB