Re: COPY Transform support

Lists: pgsql-hackers
From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: COPY Transform support
Date: 2008-04-03 13:17:23
Message-ID: 200804031517.29357.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Here's a proposal for COPY to support the T part of an ETL, that is adding the
capability for COPY FROM to Transform the data it gets.

The idea is quite simple: adding to COPY FROM the option to run a function on
the data before to call datatype_in functions. This needs some syntax
addition to be worked out at the COPY side, then the COPY code will have to
run the given function on the read data and consider giving the output of it
to current COPY code (datatype input function).

The function could either get the data as text or bytea, and would have to
return either text or bytea. bytea seems the more sensible choice, as long as
we don't lose encoding information there, which I'm not sure about.

The syntax could be something like:
COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc;

I tried to only add keywords already present in [1], while getting something
meaningfull... and x is intended to be the column number, counting from 1.
[1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html

Comments?
--
dim


From: NikhilS <nikkhils(at)gmail(dot)com>
To: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 13:59:08
Message-ID: d3c4af540804030659h70e1ed5aw57305b0e69e056e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
wrote:

> Here's a proposal for COPY to support the T part of an ETL, that is adding
> the
> capability for COPY FROM to Transform the data it gets.
>
> The idea is quite simple: adding to COPY FROM the option to run a function
> on
> the data before to call datatype_in functions. This needs some syntax
> addition to be worked out at the COPY side, then the COPY code will have
> to
> run the given function on the read data and consider giving the output of
> it
> to current COPY code (datatype input function).
>
> The function could either get the data as text or bytea, and would have to
> return either text or bytea. bytea seems the more sensible choice, as long
> as
> we don't lose encoding information there, which I'm not sure about.
>
> The syntax could be something like:
> COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc;
>
> I tried to only add keywords already present in [1], while getting
> something
> meaningfull... and x is intended to be the column number, counting from 1.
> [1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html
>
> Comments?
> --
> dim
>

+1

Data transformation while doing a data load is a requirement now and then.
Considering that users will have to do mass updates *after* the load
completes to mend the data to their liking should be reason enough to do
this while the loading is happening. I think to go about it the right way we
should support the following:

* The ability to provide per-column transformation expressions

COPY mytable (col1 transform to "col1 + 10", col2 transform to "'Post' ||
'greSQL'", col3...) FROM ..

* The ability to use any kind of expressions while doing the transformation
The transformation expression should be any expression (basically
ExecEvalExpr) that can be evaluated to give a resulting value and obviously
a corresponding is_null value too. It should and could be system in-built
functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
functions too

* The transformation expression can refer to other columns involved in the
load. So that when the current row is extracted from the input file, the
current values should be used to generate the new resultant values before
doing a heap_form_tuple. E.g.

(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
"UPPER(col1 || col3)",...)

I have spent some thoughts on how to do this and will be happy to share the
same if the list is interested. Personally, I think data transformation
using such expressions is a pretty powerful and important activity while
doing the data load itself.

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com


From: PFC <lists(at)peufeu(dot)com>
To: NikhilS <nikkhils(at)gmail(dot)com>, "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 14:44:56
Message-ID: op.t81hw6jmcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Data transformation while doing a data load is a requirement now and
> then.
> Considering that users will have to do mass updates *after* the load
> completes to mend the data to their liking should be reason enough to do
> this while the loading is happening. I think to go about it the right
> way we
> should support the following:

> * The ability to provide per-column transformation expressions
> * The ability to use any kind of expressions while doing the
> transformation
> The transformation expression should be any expression (basically
> ExecEvalExpr) that can be evaluated to give a resulting value and
> obviously
> a corresponding is_null value too. It should and could be system in-built
> functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
> functions too
> * The transformation expression can refer to other columns involved in
> the
> load. So that when the current row is extracted from the input file, the
> current values should be used to generate the new resultant values before
> doing a heap_form_tuple. E.g.
> (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
> "UPPER(col1 || col3)",...)
> I have spent some thoughts on how to do this and will be happy to share
> the
> same if the list is interested. Personally, I think data transformation
> using such expressions is a pretty powerful and important activity while
> doing the data load itself.

Well, since COPY is about as fast as INSERT INTO ... SELECT plus the
parsing overead, I suggest adding a special SELECT form that can read from
a file instead of a table, which returns tuples, and which therefore can
be used and abused to the user's liking. This is a much more powerful
feature because :

- there is almost no new syntax
- it is much simpler for the user
- lots of existing stuff can be leveraged

EXAMPLE :

Suppose I want to import a MySQL dump file (gasp !) which obviously
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken
foreign keys, etc.

Let's have a new command :

CREATE FLATFILE READER mydump (
id INTEGER,
date TEXT,
...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;

This command would create a set-returning function which is basically a
wrapper around the existing parser in COPY.
Column definition gives a name and type to the fields in the text file,
and tells the parser what to expect and what to return.
It looks like a table definition, and this is actually pretty normal : it
is, after all, very close to a table.

INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'
), ... FROM mydump WHERE (FKs check and drop the borken records);

Now I can import data and transform it at will using a simple SELECT. The
advantage is that everybody will know what to do without learning a new
command, no awkward syntax (transform...), you can combine columns in
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,
anything you want, without any extension to the Postgres engine besides
the creation of this file-parsing set-returning function, which should be
pretty simple.

Or, if I have a few gigabytes of logs, but I am absolutely not interested
in inserting them into a table, instead I want to make some statistics, or
perhaps I want to insert into my table some aggregate computation from
this data, I would just :

CREATE FLATFILE READER accesses_dump (
date TEXT,
ip INET,
...
) FROM file 'web_server_logtxt';

And I can do some stats without even loading the data :

SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)
HAVING count(*) > 1000;

Much better than having to load those gigabytes just to make a query on
them...


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: NikhilS <nikkhils(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 14:57:53
Message-ID: 1207234673.8259.45.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
> CREATE FLATFILE READER mydump (
> id INTEGER,
> date TEXT,
> ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;

Very cool idea, but why would you need to create a reader object
first ? You should be able to use COPY directly with the target table
being omitted, meaning the copy will not pump it's result in the target
but be equivalent to a select... and use it in any place where a select
can be used. This would have absolutely no new syntax, just the rules
changed...

Now that I had a second look you actually need the field definitions to
meaningfully interpret the file, but then why not use a record
specification instead of the table in the normal COPY command ? I'm not
sure if there's any existing syntax for that but I would guess yes...

In any case, such a feature would help a lot in processing input files
based also on other existing data in the DB.

Cheers,
Csaba.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 15:00:26
Message-ID: 29848.1207234826@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
> Here's a proposal for COPY to support the T part of an ETL, that is adding the
> capability for COPY FROM to Transform the data it gets.

> The idea is quite simple: adding to COPY FROM the option to run a function on
> the data before to call datatype_in functions.

The major concern I have about this is to ensure that no detectable
overhead is added to COPY when the feature isn't being used.

I am not actually convinced that the column-by-column design you seem to
have in mind is worth anything. The examples that I remember seeing
often involve removing columns, generating one column from multiple ones
or vice versa, dealing with nonstandard column delimiters, etc. What
would makes sense in my mind is a single function taking and returning
text, which is invoked once on each complete input line before it is
broken into fields.

This is, of course, just a substitute for running a sed or perl or
similar script over the data before feeding it to COPY --- and probably
not an amazingly good substitute at that. For instance, assuming you
like perl for text-wrangling, I'd fully expect the function approach
to be slower than an external script because of the large overhead of
getting into and out of libperl for each line,

In situations where it's actually useful to apply SQL functions rather
than text-mangling operations to the data, you always have the option to
COPY into a temp table and then do INSERT/SELECT from there.

So the whole thing seems just marginally attractive to me.

regards, tom lane


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: NikhilS <nikkhils(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 15:06:14
Message-ID: 200804031706.17334.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le jeudi 03 avril 2008, PFC a écrit :
> CREATE FLATFILE READER mydump (
>         id      INTEGER,
>         date    TEXT,
>         ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;
[...]
> INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
> '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken
> records);

What do we gain against current way of doing it, which is:
COPY loadtable FROM 'dump.txt' WITH ...
INSERT INTO destination_table(...) SELECT ... FROM loadtable;

--
dim


From: PFC <lists(at)peufeu(dot)com>
To: "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc: NikhilS <nikkhils(at)gmail(dot)com>, "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 15:15:35
Message-ID: op.t81jb9ydcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <nagy(at)ecircle-ag(dot)com> wrote:

> On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
>> CREATE FLATFILE READER mydump (
>> id INTEGER,
>> date TEXT,
>> ...
>> ) FROM file 'dump.txt'
>> (followed by delimiter specification syntax identical to COPY, etc)
>> ;
>
> Very cool idea, but why would you need to create a reader object
> first ? You should be able to use COPY directly with the target table
> being omitted, meaning the copy will not pump it's result in the target
> but be equivalent to a select... and use it in any place where a select
> can be used. This would have absolutely no new syntax, just the rules
> changed...
>
> Now that I had a second look you actually need the field definitions to
> meaningfully interpret the file,

Yeah, you need to tell Postgres the field names, types, and NULLness
before it can parse them... or else it's just a plain flat text file which
makes no sense...
> but then why not use a record
> specification instead of the table in the normal COPY command ? I'm not
> sure if there's any existing syntax for that but I would guess yes...

Hm, yeah, that's even simpler, just create a type for the row (or just
use table%ROWTYPE if you have a table that fits the description), and tell
COPY to parse according to the row type definition... smart...

Like :

CREATE TYPE import_rowtype AS (id INTEGER, date TEXT);
INSERT INTO mytable (id, date, ...)
SELECT id, NULLIF( date, '0000-00-00' )::DATE
FROM (COPY AS import_rowtype FROM 'mysql_trash.txt') AS foo
WHERE (FKs check and drop the borken records);

Looks clean...

Obviously, in this case (and also in my proposal's case) you must use
COPY and not \copy since it is the database server which will be reading
the file.
This could probably be hacked so the client sends the file via the \copy
interface, too...

> In any case, such a feature would help a lot in processing input files
> based also on other existing data in the DB.

Yeah, it would be cool.
Also, since COPY TO can use a SELECT as a data source, you could use
postgres to read from a file/pipe, process data, and write to a file/pipe
(kinda better than sed, lol)


From: PFC <lists(at)peufeu(dot)com>
To: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>
Cc: NikhilS <nikkhils(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 15:22:28
Message-ID: op.t81jnqu7cigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
>> '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken
>> records);
>
> What do we gain against current way of doing it, which is:
> COPY loadtable FROM 'dump.txt' WITH ...
> INSERT INTO destination_table(...) SELECT ... FROM loadtable;

You read and write the data only once instead of twice (faster) if you
want to import all of it.
If you just want to compute some aggregates and store the results in a
table, you just read the data once and don't write it at all.

The advantages are the same than your proposed transformations to COPY,
except I feel this way of doing it opens more options (like, you can
combine columns, check FKs at load, do queries on data without loading it,
don't necessarily have to insert the data in a table, don't have to invent
a new syntax to express the transformations, etc).


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: COPY Transform support
Date: 2008-04-03 15:46:05
Message-ID: 200804031746.07936.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le jeudi 03 avril 2008, Tom Lane a écrit :
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.

Well, when COLUMN x CONVERT USING or whatever syntax we choose is not used, we
default to current code path, that is we do not mess with data content at all
before to consider it's valid input syntax for target table datatypes.

And the syntax check is done only once, before beginning to read the data
lines from the file.

> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything. The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.

Yes, this is another need, but actually better solved, in my opinion, with
loading data into a (temp) loadtable then process it with SQL:
INSERT INTO destination_table SELECT whatever FROM loadtable;

The problem I'm trying to solve is not this one, I'm trying to have COPY able
to load data into a table when the representation of it we have into the file
does not match what datatype input function expects.

An example might help us talking about the same thing. mysqldump CSV outputs
timestamp sometimes (depending on server version) as '20041002152952' when
PostgreSQL expects '2004-10-02 15:29:52'. I'd like COPY to be able to cope
with this situation.

Now, another syntax proposal could have both the needs solved. We basically
need to be able to transform input fields and process them into input
columns, in a way that N input fields (found in the data file) will get us M
input columns:

COPY destination_table(col1, col2, col3, col4)
USING (field1, field2 || field3, myfunc(field4, field5))
FROM 'file.txt'
WITH ...

This could get better than preprocessing then COPY then INSERT INTO ... SELECT
because we don't need a temp table (don't need to care about its name being
unique, nor to mess up with temp_buffers), etc.
You're the one able to tell why it'll be better to have one COPY command
instead of a two table steps load, I'm just guessing ;)

And if it's better for the user to preprocess in perl then COPY, he still has
the option.
--
dim


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 17:23:09
Message-ID: 87sky2g8ea.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
>> Here's a proposal for COPY to support the T part of an ETL, that is adding the
>> capability for COPY FROM to Transform the data it gets.
>
>> The idea is quite simple: adding to COPY FROM the option to run a function on
>> the data before to call datatype_in functions.
>
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.
>
> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything. The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc. What
> would makes sense in my mind is a single function taking and returning
> text, which is invoked once on each complete input line before it is
> broken into fields.

I think not having to deal with separating fields is actually one of the few
reasons to do this within COPY. If you can separate out yourself or need to do
something more clever than COPY is capable of to split the columns then you're
better off preprocessing it with perl or something anyways.

To that end all the other use cases you describe could be handled with his
plan. There's nothing stopping you from doing

CREATE READER foo (a integer, b integer)
INSERT INTO b (SELECT a+b FROM foo);
or
INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo)

However I'm not sure we even need new syntax for CREATE READER. I would think
something like this would make more sense:

CREATE FUNCTION transform(integer, integer) RETURNS SETOF b;

COPY b FROM 'foo' USING transform(integer,integer);

> So the whole thing seems just marginally attractive to me.

Everything about ETL is only marginally attractive, but it's something people
spend a lot of time doing. Nobody's come up with any particularly clean
solutions I think.

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 19:57:38
Message-ID: 15798.1207252658@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> AFAIK the state of the art is actually to load the data into a table which
> closely matches the source material, sometimes just columns of text. Then copy
> it all to another table doing transformations. Not impressed.

I liked the idea of allowing COPY FROM to act as a table source in a
larger SELECT or INSERT...SELECT. Not at all sure what would be
involved to implement that, but it seems a lot more flexible than
any other approach.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-03 21:51:28
Message-ID: 47F55160.402@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>
>> AFAIK the state of the art is actually to load the data into a table which
>> closely matches the source material, sometimes just columns of text. Then copy
>> it all to another table doing transformations. Not impressed.
>>
>
> I liked the idea of allowing COPY FROM to act as a table source in a
> larger SELECT or INSERT...SELECT. Not at all sure what would be
> involved to implement that, but it seems a lot more flexible than
> any other approach.
>
>
>

Several years ago Bruce and I discussed the then theoretical use of a
SELECT query as the source for COPY TO, and we agreed that the sane
analog would be to have an INSERT query as the target of COPY FROM.

This idea seems to take that rather further. If doable I think it would
be cool, as long as people don't try using it as an alternative storage
engine. I can just imagine people creating views over such SELECT
statements ...

cheers

andrew


From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY Transform support
Date: 2008-04-04 01:20:08
Message-ID: 20080404012008.GO6870@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> > AFAIK the state of the art is actually to load the data into a table which
> > closely matches the source material, sometimes just columns of text. Then copy
> > it all to another table doing transformations. Not impressed.
>
> I liked the idea of allowing COPY FROM to act as a table source in a
> larger SELECT or INSERT...SELECT. Not at all sure what would be
> involved to implement that, but it seems a lot more flexible than
> any other approach.

I'm not sure why new syntax is needed, what's wrong with having a simple
set of procedures like:

readtsv(filename TEXT) AS SETOF RECORD

You'd then be free to do whatever "transformations" you wanted:

INSERT INTO table (i,j)
SELECT i, MIN(j::INTEGER)
FROM readtsv("file.dat") x(i INTEGER, j TEXT)
WHERE j ~ '^[0-9]+$'
GROUP BY i;

You could even have a readlines(filename) procedure that just gives you
back a SETOF TEXT and you can do the parsing yourself. An associated
regexp split to RECORD would be nice then.

Sam


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY Transform support
Date: 2008-04-04 01:38:42
Message-ID: 188.1207273122@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sam Mason <sam(at)samason(dot)me(dot)uk> writes:
> On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
>> I liked the idea of allowing COPY FROM to act as a table source in a
>> larger SELECT or INSERT...SELECT. Not at all sure what would be
>> involved to implement that, but it seems a lot more flexible than
>> any other approach.

> I'm not sure why new syntax is needed, what's wrong with having a simple
> set of procedures like:
> readtsv(filename TEXT) AS SETOF RECORD

Yeah, I was thinking about that too. The main stumbling block is that
you need to somehow expose all of COPY's options for parsing an input
line (CSV vs default mode, quote and delimiter characters, etc).
It's surely doable but it might be pretty ugly compared to bespoke
syntax.

Another thing is that nodeFunctionScan.c is not really designed for
enormous function result sets --- it dumps the results into a tuplestore
whether that's needed or not. This is a performance bug that we ought
to address anyway, but we'd really have to fix it if we want to approach
the COPY problem this way. Just sayin'.

regards, tom lane


From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY Transform support
Date: 2008-04-04 11:41:16
Message-ID: 20080404114116.GQ6870@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 03, 2008 at 09:38:42PM -0400, Tom Lane wrote:
> Sam Mason <sam(at)samason(dot)me(dot)uk> writes:
> > On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> >> I liked the idea of allowing COPY FROM to act as a table source in a
> >> larger SELECT or INSERT...SELECT. Not at all sure what would be
> >> involved to implement that, but it seems a lot more flexible than
> >> any other approach.
>
> > I'm not sure why new syntax is needed, what's wrong with having a simple
> > set of procedures like:
> > readtsv(filename TEXT) AS SETOF RECORD
>
> Yeah, I was thinking about that too. The main stumbling block is that
> you need to somehow expose all of COPY's options for parsing an input
> line (CSV vs default mode, quote and delimiter characters, etc).

Guess why I chose a nice simple example!

> It's surely doable but it might be pretty ugly compared to bespoke
> syntax.

Yes, that's an easy way to get it looking pretty.

As an alternative solution, how about having some datatype that stores
these parameters. E.g:

CREATE TYPE copyoptions (
delimiter TEXT CHECK (delimiter <> ""),
nullstr TEXT,
hasheader BOOLEAN,
quote TEXT,
escape TEXT
);

And have the input_function understand the current PG syntax for COPY
options. You'd then be able to do:

copyfrom('dummy.csv',$$ DELIMITER ';' CSV HEADER $$)

And the procedure would be able to pull out what it wanted from the
options.

> Another thing is that nodeFunctionScan.c is not really designed for
> enormous function result sets --- it dumps the results into a tuplestore
> whether that's needed or not. This is a performance bug that we ought
> to address anyway, but we'd really have to fix it if we want to approach
> the COPY problem this way. Just sayin'.

So you'd end up with something resembling a coroutine? When would it
be good to actually dump everything into a tuplestore as it does at the
moment?

It'll be fun to see how much code breaks because it relies on the
current behaviour of a SRF running to completion without other activity
happening between!

Sam


From: Decibel! <decibel(at)decibel(dot)org>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-07 18:44:41
Message-ID: 812E5C0C-8353-446C-91EF-E990E2909444@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
> Several years ago Bruce and I discussed the then theoretical use of
> a SELECT query as the source for COPY TO, and we agreed that the
> sane analog would be to have an INSERT query as the target of COPY
> FROM.
>
> This idea seems to take that rather further. If doable I think it
> would be cool, as long as people don't try using it as an
> alternative storage engine. I can just imagine people creating
> views over such SELECT statements ...

Why not? There's certainly cases where doing just that could be very
valuable. Storing older information that you're less likely to query
comes to mind... in those cases you're going to be seqscanning
anyway, so being able to read off a compact on-disk form is likely to
be a win performance-wise. It could certainly be a win storage-wise.

If someone wants to look at syntax options, I'm pretty certain that
Oracle supports this. IIRC you actually create what appears to the
database to be a real table, except for restrictions on what you can
actually do with it (for example, IIRC it's read-only).
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY Transform support
Date: 2008-04-07 19:04:26
Message-ID: 47FA703A.5040808@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! wrote:
> On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
>> Several years ago Bruce and I discussed the then theoretical use of a
>> SELECT query as the source for COPY TO, and we agreed that the sane
>> analog would be to have an INSERT query as the target of COPY FROM.
>>
>> This idea seems to take that rather further. If doable I think it
>> would be cool, as long as people don't try using it as an alternative
>> storage engine. I can just imagine people creating views over such
>> SELECT statements ...
>
>
> Why not? There's certainly cases where doing just that could be very
> valuable. Storing older information that you're less likely to query
> comes to mind... in those cases you're going to be seqscanning anyway,
> so being able to read off a compact on-disk form is likely to be a win
> performance-wise. It could certainly be a win storage-wise.
>
> If someone wants to look at syntax options, I'm pretty certain that
> Oracle supports this. IIRC you actually create what appears to the
> database to be a real table, except for restrictions on what you can
> actually do with it (for example, IIRC it's read-only).
>

You're serious aren't you?

Quite apart from any other reason why not, this would be a horrid hack
and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is
designed as a bulk load/unload facility. It's fragile enough in that role.

If we really want to support an alternative storage engine then we
should tackle that front on and not via a back door like this.

cheers

andrew


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "Decibel!" <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-07 19:36:40
Message-ID: 200804072136.43679.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le Monday 07 April 2008 21:04:26 Andrew Dunstan, vous avez écrit :
> Quite apart from any other reason why not, this would be a horrid hack
> and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is
> designed as a bulk load/unload facility. It's fragile enough in that role.

And my main concern would still be left as-is, COPY wouldn't have any facility
to cope with data representation not matching what datatype input functions
want to read. More often than not, you get this kind of data from another
system, so in a deterministic way, and you want a chance to transform their
representation just before PostgreSQL parses it.

May I try to insist on my second syntax proposal form:

  COPY destination_table(col1, col2, col3)
 USING (field1, field2 || field3, myfunc(field4, field5))
  FROM 'file.txt'
  WITH ...

This allows for the case I'm concerned with to be taken care of, AND the other
case pointed out by several posters on this thread too. N input fields, M
stored columns, any line to row transformation (after same column splitting
as of now), any column pre-parsing through SQL callable functions --- myfunc
is called before feeding col3%ROWTYPE input function, e.g.
And no support (that I see) for optional second storage system back door.

Comments?
--
dim


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 03:08:01
Message-ID: 14193.1207624081@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
> And my main concern would still be left as-is, COPY wouldn't have any facility
> to cope with data representation not matching what datatype input functions
> want to read.

That's sufficiently covered by the proposal to allow a COPY FROM as a
table source within SELECT, no?

regards, tom lane


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 07:59:11
Message-ID: 200804080959.13877.dfontaine@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le mardi 08 avril 2008, Tom Lane a écrit :
> Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
> > And my main concern would still be left as-is, COPY wouldn't have any
> > facility to cope with data representation not matching what datatype
> > input functions want to read.
>
> That's sufficiently covered by the proposal to allow a COPY FROM as a
> table source within SELECT, no?

Well, yes, the table source has text as datatypes and the select expression on
the column will call whatever function/cast etc to do the work. But that
opens the door to second class citizen storage solution for PostgreSQL, by
letting the user CREATE VIEW atop of it:

CREATE VIEW csv_storage AS
SELECT a, myfunc(b)::timestamp, c::int+3
FROM (COPY ... FROM '/tmp/file.csv' ...) AS x(a, b, c)
WHERE c ~ '^[0-9]+$';

What happens to the view when /tmp/file.csv is changed (new lines appended, or
complete rewrite by another application, etc)?

Andrew comment is clear about it: he does not want PostgreSQL to offer this
kind of support. I suppose it would be possible to stop CREATE VIEW to accept
any form of SELECT, but I was hoping for my idea to get back some
attractiveness at this point :)

At least I tried ;)
--
dim


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 14:14:09
Message-ID: 635.1207664049@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
> Le mardi 08 avril 2008, Tom Lane a crit:
>> That's sufficiently covered by the proposal to allow a COPY FROM as a
>> table source within SELECT, no?

> Well, yes, the table source has text as datatypes and the select expression on
> the column will call whatever function/cast etc to do the work. But that
> opens the door to second class citizen storage solution for PostgreSQL, by
> letting the user CREATE VIEW atop of it:

[ shrug... ] I don't actually have a problem with that. If we did want
to prohibit that, we'd have to somehow prohibit SRFs from reading files,
because you can do it today with any untrusted PL.

I note also that presumably COPY FROM 'file' would still be restricted
to superusers, and only COPY FROM STDIN would be available to those
without a license to shoot themselves in the foot. So the opportunity
to do anything view-like would be restricted to adults(?) anyhow.

(One of the issues that'd have to be addressed to allow a table source
syntax is whether it's sane to allow multiple COPY FROM STDIN in a
single query. If so, how does it work; if not, how do we prevent it?)

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 15:32:48
Message-ID: 47FB9020.60704@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
>
>> Le mardi 08 avril 2008, Tom Lane a écrit :
>>
>>> That's sufficiently covered by the proposal to allow a COPY FROM as a
>>> table source within SELECT, no?
>>>
>
>
>> Well, yes, the table source has text as datatypes and the select expression on
>> the column will call whatever function/cast etc to do the work. But that
>> opens the door to second class citizen storage solution for PostgreSQL, by
>> letting the user CREATE VIEW atop of it:
>>
>
> [ shrug... ] I don't actually have a problem with that. If we did want
> to prohibit that, we'd have to somehow prohibit SRFs from reading files,
> because you can do it today with any untrusted PL.
>
> I note also that presumably COPY FROM 'file' would still be restricted
> to superusers, and only COPY FROM STDIN would be available to those
> without a license to shoot themselves in the foot. So the opportunity
> to do anything view-like would be restricted to adults(?) anyhow.
>

Yeah, maybe. I will suspend my doubts for now.

> (One of the issues that'd have to be addressed to allow a table source
> syntax is whether it's sane to allow multiple COPY FROM STDIN in a
> single query. If so, how does it work; if not, how do we prevent it?)
>
>
>

I don't see why it shouldn't work. I see that copy.c now looks like it's
reentrant, unlike the bad days of old. Could we make each COPY target
behave like an SRF, stashing its data in a tuplestore?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 16:11:24
Message-ID: 2227.1207671084@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> (One of the issues that'd have to be addressed to allow a table source
>> syntax is whether it's sane to allow multiple COPY FROM STDIN in a
>> single query. If so, how does it work; if not, how do we prevent it?)

> I don't see why it shouldn't work. I see that copy.c now looks like it's
> reentrant, unlike the bad days of old. Could we make each COPY target
> behave like an SRF, stashing its data in a tuplestore?

The first question is what is the wire-protocol definition. In
particular, how would the client know what order to send the COPY
datasets in, if a single query might include multiple COPY FROM STDIN
segments?

Another point is that we surely don't want the implementation to force
use of a tuplestore all the time, so I'm not sure I buy that we can
prevent interleaving of multiple datasets on the wire that way.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 16:24:59
Message-ID: 47FB9C5B.6010004@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Could we make each COPY target
>> behave like an SRF, stashing its data in a tuplestore?
>>
>
> The first question is what is the wire-protocol definition. In
> particular, how would the client know what order to send the COPY
> datasets in, if a single query might include multiple COPY FROM STDIN
> segments?
>
> Another point is that we surely don't want the implementation to force
> use of a tuplestore all the time, so I'm not sure I buy that we can
> prevent interleaving of multiple datasets on the wire that way.
>
>
>

Is there a big demand for multiple datasets on the wire in a situation
like this? How about if we allow multiple COPY targets but at most one
from STDIN, at least for one go round?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Decibel!" <decibel(at)decibel(dot)org>, Gregory Stark <stark(at)enterprisedb(dot)com>
Subject: Re: COPY Transform support
Date: 2008-04-08 17:07:44
Message-ID: 3069.1207674464@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Is there a big demand for multiple datasets on the wire in a situation
> like this? How about if we allow multiple COPY targets but at most one
> from STDIN, at least for one go round?

That's exactly what I was saying (or at least trying to imply) as the
fallback position. But you still need a way to enforce that.

regards, tom lane