Re: COPY enhancements

Lists: pgsql-hackers
From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: COPY enhancements
Date: 2009-09-10 14:33:48
Message-ID: 4AA90E4C.7090706@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Finally the error logging and autopartitioning patches for COPY that I
presented at PGCon are here!
Error logging is described here:
http://wiki.postgresql.org/wiki/Error_logging_in_COPY
Autopartitioning is described here:
http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY

The attached patch contains both contribs as well as unit tests. I will
submit shortly the new patch at commitfest.postgresql.org.

Thanks in advance for your feedback,
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Attachment Content-Type Size
aster-copy-improvements-patch-8.5v3.txt.gz application/x-gzip 25.3 KB

From: David Fetter <david(at)fetter(dot)org>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 15:29:07
Message-ID: 20090910152907.GG20190@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 10, 2009 at 10:33:48AM -0400, Emmanuel Cecchet wrote:
> Hi all,
>
> Finally the error logging and autopartitioning patches for COPY that
> I presented at PGCon are here!

Yay!

> Error logging is described here:
> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
> Autopartitioning is described here:
> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>
> The attached patch contains both contribs as well as unit tests. I
> will submit shortly the new patch at commitfest.postgresql.org.

My C isn't all that great, to put it mildly, but I noticed C++-style
//comments like this, which we don't use.

Are these dependent on each other, or could they be reviewed, etc.
separately?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 16:06:36
Message-ID: 4AA9240C.3000301@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi David,

> My C isn't all that great, to put it mildly, but I noticed C++-style
> //comments like this, which we don't use.
>
Ok, I will fix that.
> Are these dependent on each other, or could they be reviewed, etc.
> separately?
>
The code implementing the functionality are independent (separate
methods or files) but error logging also catches potential routing
problems and log faulty tuples (the call to routing is just basically in
the try/catch of error logging).
So, yes they can be reviewed separately, but if both are integrated it
makes sense to review them together.

Cheers
Emmanuel


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 21:05:28
Message-ID: 4AA96A18.9030403@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel, Hackers,

Thank you for tackling this very long-time TODO.

> Error logging is described here:
> http://wiki.postgresql.org/wiki/Error_logging_in_COPY

Questions & Comments:

A) Why would someone want to turn error_logging on, but leave
error_logging_skip_tuples off? The pg_log already logs errors which
copy throws by default.

B) As I mentioned earlier, we'll want to provide the option of logging
to a file instead of to a table. That's not a reason to reject this
patch, but probably a TODO for 8.5.

C) Are we sure we want to handle this via GUCs rather than extensions to
COPY syntax? It seems like fairly often users would want to log
different COPY sources to different tables/files.

D) These GUCs are userset, I hope? (haven't dug into the code far
enough to tell yet).

E) What is error_logging_tuple_label for? You don't explain/give
examples. And how is error_logging_tuple_partition_key used?

F) Rawdata for rejected tuples is presumably BYTEA?

G) We should probably have a default for error_logging_table_name, such
as pg_copy_errors. Does that table get automatically created if it
doesn't exist?

H) Finally, one request of the TODO is some way to halt import after a
specified number of bad tuples because it probably means you have the
wrong file or wrong table. Do we still want that?

> Autopartitioning is described here:
> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY

M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.

N) Have you measured the overhead & speed of this kind of COPY as
opposed to COPY into a single table? Have you checked the overhead if
tuple_routing_in_copy is on, but you are not loading into a partitioned
table?

O) Is this capable of dealing with partitioning by more than one column,
or by an expression?

Finally, I'm going to suggest different names for the GUCs, as the names
you've chosen don't group well and would likely cause confusion. Here
are my suggestions, which all begin with "copy_" for prefix matching:

error_logging --> probaby not needed, see able
error_logging_skip_tuples --> copy_skip_bad_rows
error_logging_schema_name --> copy_logging_schema_name
error_logging_relation_name --> copy_logging_table_name
error_logging_tuple_label --> don't know what this is for, see above
error_logging_tuple_partition_key --> don't know what this is for, see above

tuple_routing_in_copy --> copy_partitioning
tuple_routing_cache_size --> copy_partitioning_cache_size

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 21:32:12
Message-ID: 4AA9705C.5020108@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh,

See the answers inlined.
> Thank you for tackling this very long-time TODO.
>
>> Error logging is described here:
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>
>
> Questions & Comments:
>
> A) Why would someone want to turn error_logging on, but leave
> error_logging_skip_tuples off? The pg_log already logs errors which copy throws by default.
>
When error_logging is on and skip_tuples is off, errors are logged in
the error table. If skip_tuples is on, tuples are not logged in the
error table.
> B) As I mentioned earlier, we'll want to provide the option of logging
> to a file instead of to a table. That's not a reason to reject this
> patch, but probably a TODO for 8.5.
>
Ok but what should be the format of that file?
> C) Are we sure we want to handle this via GUCs rather than extensions to
> COPY syntax? It seems like fairly often users would want to log
> different COPY sources to different tables/files.
>
I agree that new COPY options could be easier to use, the implementation
is just more complex. However, the labels allows you to select the
tuples related to specific COPY commands.
> D) These GUCs are userset, I hope? (haven't dug into the code far
> enough to tell yet).
>
Yes.
> E) What is error_logging_tuple_label for? You don't explain/give
> examples. And how is error_logging_tuple_partition_key used?
>
We use the label and partition key in Aster products to easily retrieve
which COPY command on which partition did generate the bad tuples. By
default, the tuple_label contains the COPY command that was executed
(see example on Wiki) and the key contains the index of the tuple in the
source file (see example on Wiki).
> F) Rawdata for rejected tuples is presumably BYTEA?
>
Yes. I forgot to put back the table description that can be seen in the
unit tests. I have updated the Wiki with the table definition.
> G) We should probably have a default for error_logging_table_name, such
> as pg_copy_errors. Does that table get automatically created if it
> doesn't exist?
>
Yes, as indicated on the wiki the table is created automatically (see
config variable section).
> H) Finally, one request of the TODO is some way to halt import after a
> specified number of bad tuples because it probably means you have the
> wrong file or wrong table. Do we still want that?
>
We can still do that. It can be another GUC variable or an option to
COPY. If the COPY command fails, everything gets rolled back (data in
the destination table and error table). That would be harder to
implement with a file (the rollback part).
>> Autopartitioning is described here:
>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>
>
> M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.
>
Ok.
> N) Have you measured the overhead & speed of this kind of COPY as
> opposed to COPY into a single table? Have you checked the overhead if
> tuple_routing_in_copy is on, but you are not loading into a partitioned
> table?
>
Yes. There is no noticeable overhead if there is no routing to do (but
routing is on).
If routing is involved, the overhead depends on how sorted your input
data is. If it all goes to the same partition, the caching effect works
well and there is no noticeable overhead. The cost is in the constraint
check and it depends on the complexity of the constraint. The more
constraints you have to check and the more complex they are, the more
overhead on each tuple routing.
> O) Is this capable of dealing with partitioning by more than one column,
> or by an expression?
>
Yes, we just use a brute force technique where we try all child tables
1-by-1 and rely on the existing Postgres constraint checking mechanism
(no new or duplicated code there).
> Finally, I'm going to suggest different names for the GUCs, as the names
> you've chosen don't group well and would likely cause confusion. Here
> are my suggestions, which all begin with "copy_" for prefix matching:
>
> error_logging --> probaby not needed, see able
> error_logging_skip_tuples --> copy_skip_bad_rows
> error_logging_schema_name --> copy_logging_schema_name
> error_logging_relation_name --> copy_logging_table_name
> error_logging_tuple_label --> don't know what this is for, see above
> error_logging_tuple_partition_key --> don't know what this is for, see above
>
> tuple_routing_in_copy --> copy_partitioning
> tuple_routing_cache_size --> copy_partitioning_cache_size
>
This makes sense. I'll add that on my todo list.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 22:11:03
Message-ID: 4AA97977.4020507@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel,

BTW, some of the questions were for -hackers in general to give
feedback. Don't take just my responses as final "what you have to do";
other contributors will have opinions, some of which will be more
informed than mine.

>> A) Why would someone want to turn error_logging on, but leave
>> error_logging_skip_tuples off? The pg_log already logs errors which
>> copy throws by default.
>>
> When error_logging is on and skip_tuples is off, errors are logged in
> the error table. If skip_tuples is on, tuples are not logged in the
> error table.

Although if you're not skipping tuples, presumably only the first error
is logged, yes? At that point, COPY would stop.

>> B) As I mentioned earlier, we'll want to provide the option of logging
>> to a file instead of to a table. That's not a reason to reject this
>> patch, but probably a TODO for 8.5.
>>
> Ok but what should be the format of that file?

I'd say a CSV version of the table you have is the simplest way to go.

>> C) Are we sure we want to handle this via GUCs rather than extensions to
>> COPY syntax? It seems like fairly often users would want to log
>> different COPY sources to different tables/files.
>>
> I agree that new COPY options could be easier to use, the implementation
> is just more complex. However, the labels allows you to select the
> tuples related to specific COPY commands.

Yes, and GUCs allow users to retrofit this approach onto existing
infrastructure without changing their COPY commands. So there's
advantages and disadvantages. My question was really for the -hackers
at large: is this the design we want? Or, more directly, is the GUC
approach anathema to anyone?

>> E) What is error_logging_tuple_label for? You don't explain/give
>> examples. And how is error_logging_tuple_partition_key used?
>>
> We use the label and partition key in Aster products to easily retrieve
> which COPY command on which partition did generate the bad tuples. By
> default, the tuple_label contains the COPY command that was executed
> (see example on Wiki) and the key contains the index of the tuple in the
> source file (see example on Wiki).

Ah, ok, let me suggest a modified table format then (btw, the table
format you give doesn't match your examples):

CREATE TABLE pg_copy_errors(
session_id TEXT -- session id from PostgreSQL
tupletimestamp TIMESTAMP WITH TIME ZONE,
targettable TEXT, -- table being copied to
errmessage TEXT, -- full error message encountered
sqlerrcode CHAR(5), -- sql error code
statement TEXT, -- the full copy statement which failed
label TEXT, -- optional user-supplied label
rawdata BYTEA, -- the failed data
constraint pg_copy_errors_pk primary key (session_id, tupletimestamp,
targettable)
);

.. the label would be supplied as copy_logging_label.

This would require you to collect the session_id, of course. Or the
pid, or something else. But, the table as you laid it out has no
natural key, which is a problem for debugging. Also, see discussion below.

I still don't understand how error_logging_tuple_partition_key is used.
Please give more explicit examples.

>> G) We should probably have a default for error_logging_table_name, such
>> as pg_copy_errors. Does that table get automatically created if it
>> doesn't exist?
>>
> Yes, as indicated on the wiki the table is created automatically (see
> config variable section).

As long as you use CREATE IF NOT EXISTS, that should work ok.

>> H) Finally, one request of the TODO is some way to halt import after a
>> specified number of bad tuples because it probably means you have the
>> wrong file or wrong table. Do we still want that?
>>
> We can still do that. It can be another GUC variable or an option to
> COPY. If the COPY command fails, everything gets rolled back (data in
> the destination table and error table). That would be harder to
> implement with a file (the rollback part).

With a logging file, you wouldn't worry about rollback. You'd just log
a statement to the file that it was rolled back.

I) Aster's current implementation has the advantage of being able to log
to any user-defined table, giving users the flexibility to log different
COPYs to different tables, or even put them on various tablespaces. Is
that what we want, though? Clearly it would make things simpler for
most (but not all) users to have just a canonical pg_copy_errors table
which was in pg_catalog. It would also presumably remove some code from
the patch (usually a good thing) What do people think?

J) One option I think we need if we're going to support logging to "any
defined logging table" is the option to make that table a temporary table.

>> O) Is this capable of dealing with partitioning by more than one column,
>> or by an expression?
>>
> Yes, we just use a brute force technique where we try all child tables
> 1-by-1 and rely on the existing Postgres constraint checking mechanism
> (no new or duplicated code there).

Sounds disastrous performance-wise. There's no easy way to test the
expression without attempting an actual insert?

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 22:34:36
Message-ID: 9615.1252622076@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Yes, and GUCs allow users to retrofit this approach onto existing
> infrastructure without changing their COPY commands. So there's
> advantages and disadvantages. My question was really for the -hackers
> at large: is this the design we want? Or, more directly, is the GUC
> approach anathema to anyone?

Half a dozen interrelated GUCs to control a single command fairly
screams "bad design" to me; especially the ones that specifically bear
on the command semantics, rather than being performance settings that
you could reasonably have system-wide defaults for. Could we please
look at doing it via COPY options instead?

It might be time to switch COPY over to a more easily extensible
option syntax, such as we just adopted for EXPLAIN.

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:28:57
Message-ID: 20090910232857.GN20190@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 10, 2009 at 06:34:36PM -0400, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > Yes, and GUCs allow users to retrofit this approach onto existing
> > infrastructure without changing their COPY commands. So there's
> > advantages and disadvantages. My question was really for the -hackers
> > at large: is this the design we want? Or, more directly, is the GUC
> > approach anathema to anyone?
>
> Half a dozen interrelated GUCs to control a single command fairly
> screams "bad design" to me; especially the ones that specifically bear
> on the command semantics, rather than being performance settings that
> you could reasonably have system-wide defaults for. Could we please
> look at doing it via COPY options instead?
>
> It might be time to switch COPY over to a more easily extensible
> option syntax, such as we just adopted for EXPLAIN.

+1 :)

Cheers,
David (still working on that windowing bug)
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:29:27
Message-ID: 4AA98BD7.3010805@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> It might be time to switch COPY over to a more easily extensible
> option syntax, such as we just adopted for EXPLAIN.

+1. And we could maybe have a *single* GUC which was "copy_options" as
a session default.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:33:32
Message-ID: 4AA98CCC.2010606@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh,
> BTW, some of the questions were for -hackers in general to give
> feedback. Don't take just my responses as final "what you have to do";
> other contributors will have opinions, some of which will be more
> informed than mine.
>
Understood.
>>> A) Why would someone want to turn error_logging on, but leave
>>> error_logging_skip_tuples off? The pg_log already logs errors which
>>> copy throws by default.
>>>
>>>
>> When error_logging is on and skip_tuples is off, errors are logged in
>> the error table. If skip_tuples is on, tuples are not logged in the
>> error table.
>>
>
> Although if you're not skipping tuples, presumably only the first error
> is logged, yes? At that point, COPY would stop.
>
No, when error_logging is on, COPY never stops. If skip_tuples is on,
the faulty tuples are just ignored and not logged, if it is off (it does
not skip) it logs every faulty tuple in the error table and continues
until the end. If error_logging is off, COPY stops on the first error
and aborts.
>>> B) As I mentioned earlier, we'll want to provide the option of logging
>>> to a file instead of to a table. That's not a reason to reject this
>>> patch, but probably a TODO for 8.5.
>>>
>>>
>> Ok but what should be the format of that file?
>>
>
> I'd say a CSV version of the table you have is the simplest way to go.
>
Ok.
>>> C) Are we sure we want to handle this via GUCs rather than extensions to
>>> COPY syntax? It seems like fairly often users would want to log
>>> different COPY sources to different tables/files.
>>>
>>>
>> I agree that new COPY options could be easier to use, the implementation
>> is just more complex. However, the labels allows you to select the
>> tuples related to specific COPY commands.
>>
>
> Yes, and GUCs allow users to retrofit this approach onto existing
> infrastructure without changing their COPY commands. So there's
> advantages and disadvantages. My question was really for the -hackers
> at large: is this the design we want? Or, more directly, is the GUC
> approach anathema to anyone?
>
As it is a new feature, users will either have to add the proper set
commands or modify their COPY commands. Both ways look good to me.
>>> E) What is error_logging_tuple_label for? You don't explain/give
>>> examples. And how is error_logging_tuple_partition_key used?
>>>
>>>
>> We use the label and partition key in Aster products to easily retrieve
>> which COPY command on which partition did generate the bad tuples. By
>> default, the tuple_label contains the COPY command that was executed
>> (see example on Wiki) and the key contains the index of the tuple in the
>> source file (see example on Wiki).
>>
>
> Ah, ok, let me suggest a modified table format then (btw, the table
> format you give doesn't match your examples):
>
The order of the columns does not matter, just the name.
> CREATE TABLE pg_copy_errors(
> session_id TEXT -- session id from PostgreSQL
> tupletimestamp TIMESTAMP WITH TIME ZONE,
> targettable TEXT, -- table being copied to
> errmessage TEXT, -- full error message encountered
> sqlerrcode CHAR(5), -- sql error code
> statement TEXT, -- the full copy statement which failed
> label TEXT, -- optional user-supplied label
> rawdata BYTEA, -- the failed data
> constraint pg_copy_errors_pk primary key (session_id, tupletimestamp,
> targettable)
> );
>
> .. the label would be supplied as copy_logging_label.
>
> This would require you to collect the session_id, of course. Or the
> pid, or something else. But, the table as you laid it out has no
> natural key, which is a problem for debugging. Also, see discussion below.
>
> I still don't understand how error_logging_tuple_partition_key is used.
> Please give more explicit examples.
>
I am not really sure why you need a natural key.
By default, the partition_key contains the index of the faulty entry and
label the copy command. This could be your key.
If you look at the example at
http://wiki.postgresql.org/wiki/Error_logging_in_COPY, input_file.txt
has 5 rows out of which only 1 and 5 are correct (2, 3 and 4 are bad).
The key indicates the row that caused the problem (2, 3, 4) and label
contains the full COPY statement. I am not sure to understand what the
pid or session_id would bring in that scenario.

>>> G) We should probably have a default for error_logging_table_name, such
>>> as pg_copy_errors. Does that table get automatically created if it
>>> doesn't exist?
>>>
>>>
>> Yes, as indicated on the wiki the table is created automatically (see
>> config variable section).
>>
>
> As long as you use CREATE IF NOT EXISTS, that should work ok.
>
Actually the code checks if the table exists with a try_heap_openrv and
creates the table if it fails.
>>> H) Finally, one request of the TODO is some way to halt import after a
>>> specified number of bad tuples because it probably means you have the
>>> wrong file or wrong table. Do we still want that?
>>>
>>>
>> We can still do that. It can be another GUC variable or an option to
>> COPY. If the COPY command fails, everything gets rolled back (data in
>> the destination table and error table). That would be harder to
>> implement with a file (the rollback part).
>>
>
> With a logging file, you wouldn't worry about rollback. You'd just log
> a statement to the file that it was rolled back.
>
Ok.
> I) Aster's current implementation has the advantage of being able to log
> to any user-defined table, giving users the flexibility to log different
> COPYs to different tables, or even put them on various tablespaces. Is
> that what we want, though? Clearly it would make things simpler for
> most (but not all) users to have just a canonical pg_copy_errors table
> which was in pg_catalog. It would also presumably remove some code from
> the patch (usually a good thing) What do people think?
>
The code that would be removed would just be the table creation and the
GUC variable for the table name. That would not remove much code.
> J) One option I think we need if we're going to support logging to "any
> defined logging table" is the option to make that table a temporary table.
>
Actually you can create the table beforehand (check unit tests for an
example) and so it is possible to make it a temporary table if you want
(I would just have to test it, I did not try it yet).
>>> O) Is this capable of dealing with partitioning by more than one column,
>>> or by an expression?
>>>
>>>
>> Yes, we just use a brute force technique where we try all child tables
>> 1-by-1 and rely on the existing Postgres constraint checking mechanism
>> (no new or duplicated code there).
>>
>
> Sounds disastrous performance-wise. There's no easy way to test the
> expression without attempting an actual insert?
>
An option that was proposed at PGCon by someone in the audience (sorry I
don't remember who), was to build a query plan and find out from there
which child table should get the tuple. It will be disastrous
performance-wise if you always have misses and need to check a lot of
constraints (we can always build a worst case scenario test). However,
in my tests, even with completely random data, routing is faster than a
script doing sorting in multiple files + manual inserts in each child
table. In practice, even a small cache size usually works very well as
soon as you have enough locality in your source data.
The problem is that faulty tuples will have to check all constraints to
figure out that there is no child table for them. Note that if you have
a deep tree hierarchy, you don't necessarily have to insert at the top
of the tree, you can start anywhere down the tree.
Once we have the full partitioning support (in 8.5?), we can probably
re-use some of their mechanism to route directly the tuple to the right
child.

Thanks for your great feedback,
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:44:16
Message-ID: 4AA98F50.5060604@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I am not really sure why you need a natural key.

a) because we shouldn't be building any features which teach people bad
db design, and

b) because I will presumably want to purge records from this table
periodically and doing so without a key is likely to result in purging
the wrong records.

> By default, the partition_key contains the index of the faulty entry and
> label the copy command. This could be your key.

Well, you still haven't explained the partition_key to me, so I'm not
quite clear on that. Help?

The reason why I'd like to have a session_id or pid or similar is so
that I can link the copy errors to which backend is erroring in the
other system views or in the pg_log.

Imagine a system where you have multiple network clients doing COPYs; if
one of them starts bugging out and all I have is a tablename, filename
and time, I'm not going to be able to figure out which client is causing
the problems. The reason I mention this case is that I have a client
who has a production application like this right now.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:54:47
Message-ID: 4AA991C7.5060105@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> I am not really sure why you need a natural key.
>>
>
> a) because we shouldn't be building any features which teach people bad
> db design, and
>
> b) because I will presumably want to purge records from this table
> periodically and doing so without a key is likely to result in purging
> the wrong records.
>
Agreed, but I am not sure that imposing unilaterally a key is going to
suit everyone.
>> By default, the partition_key contains the index of the faulty entry and
>> label the copy command. This could be your key.
>>
>
> Well, you still haven't explained the partition_key to me, so I'm not
> quite clear on that. Help?
>
Please re-read my previous message for the default behavior. If you look
at the example at http://wiki.postgresql.org/wiki/Error_logging_in_COPY,
input_file.txt has 5 rows out of which only 1 and 5 are correct (2, 3
and 4 are bad). The partition key indicates the row that caused the
problem (2 for row 2, 3 for row 3, ...) and label contains the full COPY
statement.
If you want to know how it is used in the Aster product, we will have to
ask Alex who did implement the feature in the product.

> The reason why I'd like to have a session_id or pid or similar is so
> that I can link the copy errors to which backend is erroring in the
> other system views or in the pg_log.
>
> Imagine a system where you have multiple network clients doing COPYs; if
> one of them starts bugging out and all I have is a tablename, filename
> and time, I'm not going to be able to figure out which client is causing
> the problems. The reason I mention this case is that I have a client
> who has a production application like this right now
All the clients are copying the same file to the same table?
I would imagine that every client processes different files and from the
file names it would be easy to identify the faulty client. I am not sure
how you would use the pid to identify the faulty client more easily?

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Alexander Sennhauser <alex(at)asterdata(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:55:36
Message-ID: 4AA991F8.3080608@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

Let me try to give more context on some of the things discussed.
Feedback is appreciated.

Thanks

- A

Emmanuel Cecchet wrote:
> Josh,
>> BTW, some of the questions were for -hackers in general to give
>> feedback. Don't take just my responses as final "what you have to do";
>> other contributors will have opinions, some of which will be more
>> informed than mine.
>>
> Understood.
>>>> A) Why would someone want to turn error_logging on, but leave
>>>> error_logging_skip_tuples off? The pg_log already logs errors which
>>>> copy throws by default.
>>>>
>>>>
>>> When error_logging is on and skip_tuples is off, errors are logged in
>>> the error table. If skip_tuples is on, tuples are not logged in the
>>> error table.
>>>
>> Although if you're not skipping tuples, presumably only the first error
>> is logged, yes? At that point, COPY would stop.
>>
> No, when error_logging is on, COPY never stops. If skip_tuples is on,
> the faulty tuples are just ignored and not logged, if it is off (it does
> not skip) it logs every faulty tuple in the error table and continues
> until the end. If error_logging is off, COPY stops on the first error
> and aborts.
>>>> B) As I mentioned earlier, we'll want to provide the option of logging
>>>> to a file instead of to a table. That's not a reason to reject this
>>>> patch, but probably a TODO for 8.5.
>>>>
>>>>
>>> Ok but what should be the format of that file?
>>>
>> I'd say a CSV version of the table you have is the simplest way to go.
>>
> Ok.
>>>> C) Are we sure we want to handle this via GUCs rather than extensions to
>>>> COPY syntax? It seems like fairly often users would want to log
>>>> different COPY sources to different tables/files.
>>>>
>>>>
>>> I agree that new COPY options could be easier to use, the implementation
>>> is just more complex. However, the labels allows you to select the
>>> tuples related to specific COPY commands.
>>>
>> Yes, and GUCs allow users to retrofit this approach onto existing
>> infrastructure without changing their COPY commands. So there's
>> advantages and disadvantages. My question was really for the -hackers
>> at large: is this the design we want? Or, more directly, is the GUC
>> approach anathema to anyone?
>>
> As it is a new feature, users will either have to add the proper set
> commands or modify their COPY commands. Both ways look good to me.

The implementation of this feature at Aster required us to set options
at the PostgreSQL backend according to the user input, which comes from
an enhanced COPY command (just as suggested above). I agree that from a
user's perspective it makes more sense to do something like this:

COPY foo from '/tmp/foo.txt' with csv LOG ERRORS INTO my_error_table;

>>>> E) What is error_logging_tuple_label for? You don't explain/give
>>>> examples. And how is error_logging_tuple_partition_key used?
>>>>

We populate this field with a session id coming from our system. This
will allow to identify malformed tuples coming from different sessions.
In the case of PostgreSQL we can use a similar field (e.g. transaction
id or session id).

>>>>
>>> We use the label and partition key in Aster products to easily retrieve
>>> which COPY command on which partition did generate the bad tuples. By
>>> default, the tuple_label contains the COPY command that was executed
>>> (see example on Wiki) and the key contains the index of the tuple in the
>>> source file (see example on Wiki).
>>>
>> Ah, ok, let me suggest a modified table format then (btw, the table
>> format you give doesn't match your examples):
>>
> The order of the columns does not matter, just the name.
>> CREATE TABLE pg_copy_errors(
>> session_id TEXT -- session id from PostgreSQL
>> tupletimestamp TIMESTAMP WITH TIME ZONE,
>> targettable TEXT, -- table being copied to
>> errmessage TEXT, -- full error message encountered
>> sqlerrcode CHAR(5), -- sql error code
>> statement TEXT, -- the full copy statement which failed
>> label TEXT, -- optional user-supplied label
>> rawdata BYTEA, -- the failed data
>> constraint pg_copy_errors_pk primary key (session_id, tupletimestamp,
>> targettable)
>> );
>>
>> .. the label would be supplied as copy_logging_label.
>>
>> This would require you to collect the session_id, of course. Or the
>> pid, or something else. But, the table as you laid it out has no
>> natural key, which is a problem for debugging. Also, see discussion below.
>>
>> I still don't understand how error_logging_tuple_partition_key is used.
>> Please give more explicit examples.
>>
> I am not really sure why you need a natural key.
> By default, the partition_key contains the index of the faulty entry and
> label the copy command. This could be your key.
> If you look at the example at
> http://wiki.postgresql.org/wiki/Error_logging_in_COPY, input_file.txt
> has 5 rows out of which only 1 and 5 are correct (2, 3 and 4 are bad).
> The key indicates the row that caused the problem (2, 3, 4) and label
> contains the full COPY statement. I am not sure to understand what the
> pid or session_id would bring in that scenario.
>

Again, the key field makes a lot of sense in the case where you have
multiple PostgreSQL instances acting as different "shards" in a cluster
setup. The key column would then contain which shard logged the error to
it's local table. We could just drop this column in the single
PostgreSQL case.

>>>> G) We should probably have a default for error_logging_table_name, such
>>>> as pg_copy_errors. Does that table get automatically created if it
>>>> doesn't exist?
>>>>
>>>>
>>> Yes, as indicated on the wiki the table is created automatically (see
>>> config variable section).
>>>
>> As long as you use CREATE IF NOT EXISTS, that should work ok.
>>
> Actually the code checks if the table exists with a try_heap_openrv and
> creates the table if it fails.
>>>> H) Finally, one request of the TODO is some way to halt import after a
>>>> specified number of bad tuples because it probably means you have the
>>>> wrong file or wrong table. Do we still want that?
>>>>
>>>>
>>> We can still do that. It can be another GUC variable or an option to
>>> COPY. If the COPY command fails, everything gets rolled back (data in
>>> the destination table and error table). That would be harder to
>>> implement with a file (the rollback part).
>>>
>> With a logging file, you wouldn't worry about rollback. You'd just log
>> a statement to the file that it was rolled back.
>>
> Ok.

Where would this file live then? Since you are doing all the error
logging at the backend, I am assuming it would need to live "on the
server node". As opposed to a file which lies in the same directory from
where the original file was loaded via, let's say, psql.

>> I) Aster's current implementation has the advantage of being able to log
>> to any user-defined table, giving users the flexibility to log different
>> COPYs to different tables, or even put them on various tablespaces. Is
>> that what we want, though? Clearly it would make things simpler for
>> most (but not all) users to have just a canonical pg_copy_errors table
>> which was in pg_catalog. It would also presumably remove some code from
>> the patch (usually a good thing) What do people think?
>>
> The code that would be removed would just be the table creation and the
> GUC variable for the table name. That would not remove much code.
>> J) One option I think we need if we're going to support logging to "any
>> defined logging table" is the option to make that table a temporary table.
>>
> Actually you can create the table beforehand (check unit tests for an
> example) and so it is possible to make it a temporary table if you want
> (I would just have to test it, I did not try it yet).
>>>> O) Is this capable of dealing with partitioning by more than one column,
>>>> or by an expression?
>>>>
>>>>
>>> Yes, we just use a brute force technique where we try all child tables
>>> 1-by-1 and rely on the existing Postgres constraint checking mechanism
>>> (no new or duplicated code there).
>>>
>> Sounds disastrous performance-wise. There's no easy way to test the
>> expression without attempting an actual insert?
>>
> An option that was proposed at PGCon by someone in the audience (sorry I
> don't remember who), was to build a query plan and find out from there
> which child table should get the tuple. It will be disastrous
> performance-wise if you always have misses and need to check a lot of
> constraints (we can always build a worst case scenario test). However,
> in my tests, even with completely random data, routing is faster than a
> script doing sorting in multiple files + manual inserts in each child
> table. In practice, even a small cache size usually works very well as
> soon as you have enough locality in your source data.
> The problem is that faulty tuples will have to check all constraints to
> figure out that there is no child table for them. Note that if you have
> a deep tree hierarchy, you don't necessarily have to insert at the top
> of the tree, you can start anywhere down the tree.
> Once we have the full partitioning support (in 8.5?), we can probably
> re-use some of their mechanism to route directly the tuple to the right
> child.
>
>
> Thanks for your great feedback,
> Emmanuel
>


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 23:58:01
Message-ID: 4AA99289.2060806@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> It might be time to switch COPY over to a more easily extensible
> option syntax, such as we just adopted for EXPLAIN.
>
Could you point me to the thread detailing the extensible option syntax
for EXPLAIN? Is that new to 8.5?

Thanks
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 02:53:27
Message-ID: 603c8f070909101953y44c262ffn4ea744d235054ab3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 10, 2009 at 6:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Yes, and GUCs allow users to retrofit this approach onto existing
>> infrastructure without changing their COPY commands.  So there's
>> advantages and disadvantages.  My question was really for the -hackers
>> at large: is this the design we want?  Or, more directly, is the GUC
>> approach anathema to anyone?
>
> Half a dozen interrelated GUCs to control a single command fairly
> screams "bad design" to me; especially the ones that specifically bear
> on the command semantics, rather than being performance settings that
> you could reasonably have system-wide defaults for.  Could we please
> look at doing it via COPY options instead?
>
> It might be time to switch COPY over to a more easily extensible
> option syntax, such as we just adopted for EXPLAIN.

I like this idea, perhaps not surprisingly (for those not following at
home: that was my patch). Unfortunately, it looks to me like there is
no way to do this without overhauling the syntax. If the existing
syntax required a comma between options (i.e. "copy blah to stdout
binary, oids" rather than "copy to stdout binary oids", this would be
pretty straightforward; but it doesn't even allow one).

I wonder if we should consider allowing COPY options to be
comma-separated beginning with 8.5, and then require it in 8.6. Other
options include continuing to support the old syntax for the existing
options, but allowing some new syntax as well and requiring its use
for all new options (this is what we did with EXPLAIN, but there were
only two pre-existing options there), and just changing the syntax
incompatibly and telling users to suck it up. But I'm not sure I like
either of those choices.

...Robert


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 13:16:19
Message-ID: 4AAA4DA3.4010006@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Robert,

> I like this idea, perhaps not surprisingly (for those not following at
> home: that was my patch). Unfortunately, it looks to me like there is
> no way to do this without overhauling the syntax. If the existing
> syntax required a comma between options (i.e. "copy blah to stdout
> binary, oids" rather than "copy to stdout binary oids", this would be
> pretty straightforward; but it doesn't even allow one).
>
Well some options like CSV FORCE ... take a comma separated list of
columns. This would require all options to become reserved keywords or
force parenthesis around option parameters.
> I wonder if we should consider allowing COPY options to be
> comma-separated beginning with 8.5, and then require it in 8.6. Other
> options include continuing to support the old syntax for the existing
> options, but allowing some new syntax as well and requiring its use
> for all new options (this is what we did with EXPLAIN, but there were
> only two pre-existing options there), and just changing the syntax
> incompatibly and telling users to suck it up. But I'm not sure I like
> either of those choices.
>
We could keep the current syntax for backward compatibility only (can be
dropped in a future release) and have a new syntax (starting in 8.5). To
avoid confusion between both, we could just replace WITH with something
else (or just drop it) to indicate that this is the new syntax.

The new syntax could look like:

COPY /tablename/ [ ( /column/ [, ...] ) ]
FROM { '/filename/' | STDIN }
[ [, BINARY ]
[, OIDS ]
[, DELIMITER [ AS ] '/delimiter/' ]
[, NULL [ AS ] '/null string/' ]
[, CSV [ HEADER ]
[ QUOTE [ AS ] '/quote/' ]
[ ESCAPE [ AS ] '/escape/' ]
[ FORCE NOT NULL (/column/ [, ...]) ]
[, ERRORS { SKIP |
LOG INTO { tablename | 'filename' }
[ LABEL label_name ]
[ KEY key_name ]
[ MAX ERRORS /count/ ] } ]

Is this what you had in mind?

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 14:18:51
Message-ID: 19938.1252678731@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
> The new syntax could look like:

> COPY /tablename/ [ ( /column/ [, ...] ) ]
> FROM { '/filename/' | STDIN }
> [ [, BINARY ]
> [, OIDS ]
> [, DELIMITER [ AS ] '/delimiter/' ]
> [, NULL [ AS ] '/null string/' ]
> [, CSV [ HEADER ]
> [ QUOTE [ AS ] '/quote/' ]
> [ ESCAPE [ AS ] '/escape/' ]
> [ FORCE NOT NULL (/column/ [, ...]) ]
> [, ERRORS { SKIP |
> LOG INTO { tablename | 'filename' }
> [ LABEL label_name ]
> [ KEY key_name ]
> [ MAX ERRORS /count/ ] } ]

> Is this what you had in mind?

No. because that doesn't do a darn thing to make the option set less
hard-wired into the syntax. I was thinking of a strict keyword/value
format with non-wired-in keywords ... and only *one* keyword per value.
See EXPLAIN.

regards, tom lane


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 14:53:33
Message-ID: 4AAA646D.5080103@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

I looked at EXPLAIN
(http://www.postgresql.org/docs/current/interactive/sql-explain.html)
and there is not a single line of what you are talking about.
And the current syntax is just EXPLAIN [ ANALYZE ] [ VERBOSE ] /statement
/
If I try to decrypt what you said, you are looking at something like

COPY /tablename/ [ ( /column/ [, ...] ) ]
FROM { '/filename/' | STDIN }
[option1=value[,...]]

That would give something like:
COPY foo FROM 'input.txt' binary=on, oids=on, errors=skip, max_errors=10

If this is not what you are thinking, please provide an example.

Emmanuel
/

/
> Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
>
>> The new syntax could look like:
>>
>
>
>> COPY /tablename/ [ ( /column/ [, ...] ) ]
>> FROM { '/filename/' | STDIN }
>> [ [, BINARY ]
>> [, OIDS ]
>> [, DELIMITER [ AS ] '/delimiter/' ]
>> [, NULL [ AS ] '/null string/' ]
>> [, CSV [ HEADER ]
>> [ QUOTE [ AS ] '/quote/' ]
>> [ ESCAPE [ AS ] '/escape/' ]
>> [ FORCE NOT NULL (/column/ [, ...]) ]
>> [, ERRORS { SKIP |
>> LOG INTO { tablename | 'filename' }
>> [ LABEL label_name ]
>> [ KEY key_name ]
>> [ MAX ERRORS /count/ ] } ]
>>
>
>
>> Is this what you had in mind?
>>
>
> No. because that doesn't do a darn thing to make the option set less
> hard-wired into the syntax. I was thinking of a strict keyword/value
> format with non-wired-in keywords ... and only *one* keyword per value.
> See EXPLAIN.
>
> regards, tom lane
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:04:06
Message-ID: 603c8f070909110804r424ea05fhfe3f7852146756ad@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 10:53 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Tom,
>
> I looked at EXPLAIN
> (http://www.postgresql.org/docs/current/interactive/sql-explain.html) and
> there is not a single line of what you are talking about.
> And the current syntax is just EXPLAIN [ ANALYZE ] [ VERBOSE ] /statement
> /

http://developer.postgresql.org/pgdocs/postgres/sql-explain.html

Or look at your CVS/git checkout.

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:11:47
Message-ID: 603c8f070909110811p14b8912td9b8b7e66d3356e6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 10:18 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
>> The new syntax could look like:
>
>> COPY /tablename/ [ ( /column/ [, ...] ) ]
>>     FROM { '/filename/' | STDIN }
>>     [ [, BINARY ]
>>       [, OIDS ]
>>       [, DELIMITER [ AS ] '/delimiter/' ]
>>       [, NULL [ AS ] '/null string/' ]
>>       [, CSV [ HEADER ]
>>              [ QUOTE [ AS ] '/quote/' ]
>>              [ ESCAPE [ AS ] '/escape/' ]
>>              [ FORCE NOT NULL (/column/ [, ...]) ]
>>       [, ERRORS { SKIP |
>>                   LOG INTO { tablename | 'filename' }
>>                     [ LABEL label_name ]
>>                     [ KEY key_name ]
>>                     [ MAX ERRORS /count/ ] } ]
>
>> Is this what you had in mind?
>
> No. because that doesn't do a darn thing to make the option set less
> hard-wired into the syntax. I was thinking of a strict keyword/value
> format with non-wired-in keywords ... and only *one* keyword per value.
> See EXPLAIN.

I was thinking something like:

COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
[WITH] [option [, ...]]

Where:

option := ColId [Sconst] | FORCE NOT NULL (column [,...])

I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
into a keyword/value notation.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:13:29
Message-ID: 20680.1252682009@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Or look at your CVS/git checkout.

The important point is to look at the grammar, which doesn't have any
idea what the specific options are in the list. (Well, okay, it had
to have special cases for ANALYZE and VERBOSE because those are reserved
words :-(. But future additions will not need to touch the grammar.
In the case of COPY that also means not having to touch psql \copy.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:26:16
Message-ID: 20849.1252682776@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
> into a keyword/value notation.

Any number of ways, for example "force_not_null = true" or multiple
occurrences of "force_not_null = column_name". Andrew was on the verge
of admitting we don't need that option anymore anyway ;-), so I don't
think we should allow it to drive an exception to the simplified syntax.

regards, tom lane


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Emmanuel Cecchet" <manu(at)asterdata(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:32:40
Message-ID: op.uz3msqm9cke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I was thinking something like:
>
> COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
> [WITH] [option [, ...]]
>
> Where:
>
> option := ColId [Sconst] | FORCE NOT NULL (column [,...])
>
> I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
> into a keyword/value notation.

Postgres has a hstore data type which seems well suited for passing
key/value option pairs...
Why another syntax to remember, another parser to code, when almost
everything is already there ?

Think about plpgsql code which generates some SQL COPY command string,
then this is parsed and executed... wouldn't it be a lot simpler to just
manipulate parameters in a hstore ?...


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:33:38
Message-ID: 603c8f070909110833s443dc263g7447bcf181cc634b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 11:26 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
>> into a keyword/value notation.
>
> Any number of ways, for example "force_not_null = true" or multiple
> occurrences of "force_not_null = column_name".  Andrew was on the verge
> of admitting we don't need that option anymore anyway ;-), so I don't
> think we should allow it to drive an exception to the simplified syntax.

While I'm at least as big a fan of generic options as the next person,
syntax is cheap. I don't see any reason to get worked up about one
exception to a generic options syntax. If the feature is useless, of
course we can rip it out, but that's a separate discussion. For what
it's worth, I think your proposed alternative is ugly and an abuse of
the idea of keyword-value pairs. In the EXPLAIN-world, a later value
for the same option overrides a previous assignment earlier in the
list, and I am in favor of sticking with that approach.

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:34:04
Message-ID: 603c8f070909110834l4cc950baj108684d306f3ee0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/11 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>:
>> I was thinking something like:
>>
>> COPY tablename [ ( column [, ...] ) ] FROM { 'filename' | STDIN }
>> [WITH] [option [, ...]]
>>
>> Where:
>>
>> option := ColId [Sconst] | FORCE NOT NULL (column [,...])
>>
>> I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
>> into a keyword/value notation.
>
>        Postgres has a hstore data type which seems well suited for passing
> key/value option pairs...
>        Why another syntax to remember, another parser to code, when almost
> everything is already there ?
>
>        Think about plpgsql code which generates some SQL COPY command
> string, then this is parsed and executed... wouldn't it be a lot simpler to
> just manipulate parameters in a hstore ?...

I doubt it very much.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:37:33
Message-ID: 21068.1252683453@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> While I'm at least as big a fan of generic options as the next person,
> syntax is cheap.

No, it *isn't* cheap. Particularly not for COPY, for which we need an
ad-hoc parser in psql.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:38:00
Message-ID: 4AAA6ED8.5090309@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
>> I don't see any reasonable way to sandwhich the FORCE NOT NULL syntax
>> into a keyword/value notation.
>>
>
> Any number of ways, for example "force_not_null = true" or multiple
> occurrences of "force_not_null = column_name". Andrew was on the verge
> of admitting we don't need that option anymore anyway ;-), so I don't
> think we should allow it to drive an exception to the simplified syntax.
>
>
>

We won't need it if we can use an expression on the source, like

coalesce(t[4],"")

that gets applied on the way in. Until then it is useful, if ugly.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 15:49:21
Message-ID: 4AAA7181.7050505@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
>> Postgres has a hstore data type which seems well suited for passing
>> key/value option pairs...
>>
>>
>

Quite apart from any other reason, it is not builtin, so there is no way
that any builtin thing can use it.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 16:09:26
Message-ID: 603c8f070909110909h7d94ac02rd60433cef0385880@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 11:37 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> While I'm at least as big a fan of generic options as the next person,
>> syntax is cheap.
>
> No, it *isn't* cheap.  Particularly not for COPY, for which we need an
> ad-hoc parser in psql.

:-( That's definitely a complication, although it seems to me it will
need substantial work no matter what option we decide on.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 16:28:03
Message-ID: 21841.1252686483@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Sep 11, 2009 at 11:37 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> No, it *isn't* cheap. Particularly not for COPY, for which we need an
>> ad-hoc parser in psql.

> :-( That's definitely a complication, although it seems to me it will
> need substantial work no matter what option we decide on.

True :-(. But I'm hoping we only have to revise it once more, not every
time somebody thinks of another COPY option.

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 17:42:05
Message-ID: 20090911174205.GA12016@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 11:37:33AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > While I'm at least as big a fan of generic options as the next
> > person, syntax is cheap.
>
> No, it *isn't* cheap. Particularly not for COPY, for which we need
> an ad-hoc parser in psql.

Is there some way we can use the regular parser, as plpgsql is moving
to, or is there just too much wound into the server?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 17:50:35
Message-ID: 23261.1252691435@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> On Fri, Sep 11, 2009 at 11:37:33AM -0400, Tom Lane wrote:
>> No, it *isn't* cheap. Particularly not for COPY, for which we need
>> an ad-hoc parser in psql.

> Is there some way we can use the regular parser, as plpgsql is moving
> to, or is there just too much wound into the server?

The reason that's an option for plpgsql is it's running inside the
server. The amount of baggage gram.y would carry along is daunting.

regards, tom lane


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 18:55:43
Message-ID: 4AAA9D2F.7070501@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
>> Or look at your CVS/git checkout.
>>
>
> The important point is to look at the grammar, which doesn't have any
> idea what the specific options are in the list. (Well, okay, it had
> to have special cases for ANALYZE and VERBOSE because those are reserved
> words :-(. But future additions will not need to touch the grammar.
> In the case of COPY that also means not having to touch psql \copy.)
>
I understand the convenience from a developer perspective but I wonder
how this improves the user experience. If options are not explicitly
part of the grammar:
- you cannot do automated testing based on the grammar
- it seems that it will be harder to document
- it still requires the same amount of work in psql and 3rd party tools
to support command-completion and so on?
- why only COPY or EXPLAIN would use that syntax? what is the good limit
between an option and something that is part of the grammar?

It looks like passing the current GUC variables as options to COPY.
Isn't there a design problem with the parser if it is so hard to add a
new option to a command? In all cases, both the client and the server
will have to support the new extension (and it will have to be
documented) so it should not make a big difference whether it is
explicitly part of the command grammar or a set of generic options.

manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 19:08:31
Message-ID: 8865.1252696111@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
> Tom Lane wrote:
>> The important point is to look at the grammar, which doesn't have any
>> idea what the specific options are in the list.

> I understand the convenience from a developer perspective but I wonder
> how this improves the user experience.

It's all in the eye of the beholder I suppose, but I don't find random
pseudo-English phrases to be particularly great to remember or work with
either. The existing COPY syntax is a complete mess from a user's
viewpoint already, IMO.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 19:44:52
Message-ID: 603c8f070909111244n54892c59qd6fca9097dad3568@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 12:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Sep 11, 2009 at 11:37 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> No, it *isn't* cheap.  Particularly not for COPY, for which we need an
>>> ad-hoc parser in psql.
>
>> :-(  That's definitely a complication, although it seems to me it will
>> need substantial work no matter what option we decide on.
>
> True :-(.  But I'm hoping we only have to revise it once more, not every
> time somebody thinks of another COPY option.

Yes, I completely agree. But a special case for one existing option
won't have that result, so long as we're resolved not to accept any
more (and perhaps eventually to remove the special case once we're
confident the functionality isn't needed any longer).

Another approach would be to generalize what is allowable as an
optional parameter to include a parenthesized column list, but I don't
really think that has a lot to recommend it.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 20:02:33
Message-ID: 9919.1252699353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Another approach would be to generalize what is allowable as an
> optional parameter to include a parenthesized column list, but I don't
> really think that has a lot to recommend it.

Well, maybe it's worth doing. If you believe that somebody might think
of a new per-column COPY behavior in the future, then the same issue is
going to come up again.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 20:56:11
Message-ID: 20090911205611.GN17756@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andrew Dunstan (andrew(at)dunslane(dot)net) wrote:
> Robert Haas wrote:
>>> Postgres has a hstore data type which seems well suited for passing
>>> key/value option pairs...
>
> Quite apart from any other reason, it is not builtin, so there is no way
> that any builtin thing can use it.

Clearly, that's fixable.. I think it's an interesting concept, but I
don't know that I'd advocate it (using hstore for this in some way).

Thanks,

Stephen


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:07:15
Message-ID: 603c8f070909111407x5395a7f2mf7cdb5666fd04e2d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 4:02 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Another approach would be to generalize what is allowable as an
>> optional parameter to include a parenthesized column list, but I don't
>> really think that has a lot to recommend it.
>
> Well, maybe it's worth doing.  If you believe that somebody might think
> of a new per-column COPY behavior in the future, then the same issue is
> going to come up again.

I can't immediately think of one, but I wouldn't bet against someone
else dreaming one up.

The biggest problem I have with this change is that it's going to
massively break anyone who is using the existing COPY syntax. Really
simple examples might be OK (like if they're using 0 or 1 options),
but more complex things are going to just break. How much do we care
about that?

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:21:12
Message-ID: 603c8f070909111421x1858a05ew542eccc5b50b5b33@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/9/11 Stephen Frost <sfrost(at)snowman(dot)net>:
>>>>        Postgres has a hstore data type which seems well suited for passing
>>>> key/value option pairs...
>>
>> Quite apart from any other reason, it is not builtin, so there is no way
>> that any builtin thing can use it.
>
> Clearly, that's fixable..  I think it's an interesting concept, but I
> don't know that I'd advocate it (using hstore for this in some way).

Unless I'm missing something, which is possible, this whole line of
conversation is based on a misunderstanding. Data types, like hstore,
are things that have input/output functions, index methods, and
on-disk representations. In-memory data structures require none of
these things, and can use techniques not suitable for on-disk
representations, such as pointers. The parser already has an object
called a DefElem which is well-suited for exactly the kind of option
handling we're talking about here, and hstore would not be, not only
because it's the wrong kind of object (a data type rather than an
in-memory data structure), but because a DefElem can do things that
hstore can't, like store as the associated value a list of parse
nodes.

The original reference to hstore was a suggestion that it might be
possible to pass an hstore argument to COPY rather than having to
build up a command string and pass it to EXECUTE. That may or may not
be a useful innovation - personally, I tend to think not - but it
seems to me that it would require COPY to execute an arbitrary
subquery and use the results as options. We have no other commands
that work that way to my knowledge, but beyond that, Pierre Frédéric
Caillaud seemed to be suggesting this would be an "easier" way to
implement an options syntax. Integrating hstore into core and then
making COPY able to execute a subquery to get its options is certainly
not easier than a straightforward grammar modification; it's taking a
small project and turning it into several big ones.

...Robert


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:27:19
Message-ID: alpine.GSO.2.01.0909111651530.7278@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 11 Sep 2009, Tom Lane wrote:

> If you believe that somebody might think of a new per-column COPY
> behavior in the future, then the same issue is going to come up again.

While Andrew may have given up on a quick hack to work around his recent
request, I don't have that luxury. We've already had to add two new
behaviors here to COPY in our version and I expect more in the future.
The performance of every path to get data into the database besides COPY
is too miserable for us to use anything else, and the current
inflexibility makes it useless for anything but the cleanest input data.

The full set of new behavior here I'd like to see allows adjusting:

-Accept or reject rows with extra columns?
-Accept or reject rows that are missing columns at the end?
--Fill them with the default for the column (if available) or NULL?
-Save rejected rows?
--To a single system table?
--To a user-defined table?
--To the database logs?

The user-defined table for rejects is obviously exclusive of the system
one, either of those would be fine from my perspective.

I wasn't really pleased with the "if it's not the most general solution
possible we're not interested" tone of Andrew's other COPY-change thread
this week. I don't think there's *that* many common requests here that
they can't all be handled by specific implementations, and the scope creep
of launching into a general framework for adding them is just going to
lead to nothing useful getting committed. If you want something really
complicated, drop into a PL-based solution. The stuff I list above I see
regular requests for at *every* PG installation I've ever been involved
in, and it would be fantastic if they were available out of the box.

But I think it's quite reasonable to say the COPY syntax needs to be
overhauled to handle all these. The two changes we've made at Truviso
both use GUCs to control their behavior, and I'm guessing Aster did that
too for the same reasons we did: it's easier to do and makes for cleaner
upstream merges. That approach doesn't really scale well though to many
options, and when considered for core the merge concerns obviously go
away. (The main reason I haven't pushed for us to submit our
customizations here is that I know perfectly well the GUC-based UI isn't
acceptable, but I haven't been able to get a better one done yet)

This auto-partioning stuff is interesting if the INSERT performance of it
can be made reasonable. I think Emmanuel is too new to the community
process here to realize that there's little hope of those getting
committed or even reviewed together. If I were reviewing this I'd just
kick it back as "separate these cleanly into separate patches where the
partitioning one depends on the logging one" before even starting to look
at the code, it's too much stuff to consume properly in one gulp.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:32:42
Message-ID: 11527.1252704762@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The biggest problem I have with this change is that it's going to
> massively break anyone who is using the existing COPY syntax.

Why? We'd certainly still support the old syntax for existing options,
just as we did with EXPLAIN.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:45:07
Message-ID: 603c8f070909111445y3862fc56yc9354d4c9ec7d371@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 5:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The biggest problem I have with this change is that it's going to
>> massively break anyone who is using the existing COPY syntax.
>
> Why?  We'd certainly still support the old syntax for existing options,
> just as we did with EXPLAIN.

None of the syntax proposals upthread had that property, which doesn't
mean we can't do it. However, we'd need some way to differentiate the
old syntax from the new one. I guess we could throw an unnecessary set
of parentheses around the option list (blech), but you have to be able
to tell from the first token which kind of list you're reading if you
want to support both sets of syntax.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 21:52:50
Message-ID: 11860.1252705970@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Sep 11, 2009 at 5:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Why? We'd certainly still support the old syntax for existing options,
>> just as we did with EXPLAIN.

> None of the syntax proposals upthread had that property, which doesn't
> mean we can't do it. However, we'd need some way to differentiate the
> old syntax from the new one. I guess we could throw an unnecessary set
> of parentheses around the option list (blech), but you have to be able
> to tell from the first token which kind of list you're reading if you
> want to support both sets of syntax.

No, you just have to be able to tell it before the first difference in
grammar reduction path. If we did the syntax as keyword = value, for
instance, I believe the first equal sign would be a sufficient cue for
bison.

Not that parentheses would be such a terrible thing, especially if your
thoughts are leaning towards making COPY-embedded-in-SELECT be special
syntax rather than trying to force it into SRF syntax.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-11 22:04:06
Message-ID: 4AAAC956.10908@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg,

> The performance of every path to get data into the database besides COPY
> is too miserable for us to use anything else, and the current
> inflexibility makes it useless for anything but the cleanest input data.

One potential issue we're facing down this road is that current COPY has
a dual purpose: for database restore, and for importing and exporting
data. At some point, we may want to separate those two behaviors,
because we'll be adding bells and fringes to import/export which slow
down overall performance or add bugs.

> The user-defined table for rejects is obviously exclusive of the system
> one, either of those would be fine from my perspective.

I've been thinking about it, and can't come up with a really strong case
for wanting a user-defined table if we settle the issue of having a
strong key for pg_copy_errors. Do you have one?

> I wasn't really pleased with the "if it's not the most general solution
> possible we're not interested" tone of Andrew's other COPY-change thread
> this week.

As someone who uses (and abuses) COPY constantly, I didn't leap at
Andrew's suggestion either because it wasn't *obviously* generally
applicable. We don't want to accept patches which are designed only to
solve the specific problems faced by one user. So for a feature
suggestion as specific as Andrew's, it's worth discussion ... out of
which came some interesting ideas, like copy to TEXT[].

Certainly we're not the project to add "quick hacks" where we can do better.

After some thought, I think that Andrew's feature *is* generally
applicable, if done as IGNORE COLUMN COUNT (or, more likely,
column_count=ignore). I can think of a lot of data sets where column
count is jagged and you want to do ELT instead of ETL. But I had to
give it some thought; as initially presented, the feature seemed very
single-user-specific.

> I don't think there's *that* many common requests here that
> they can't all be handled by specific implementations,

I disagree. That way lies maintenance hell.

> and the scope
> creep of launching into a general framework for adding them is just
> going to lead to nothing useful getting committed.

As opposed to Tom, Peter and Heikki vetoing things because the feature
gain doesn't justify the maintnenance burden? That's your real choice.
Adding a framework for manageable syntax extensions means that we can
be more liberal about what we justify as an extension.

There is a database which allows unrestricted addition of ah-hoc
features. It's called MySQL. They have double the code lines count we
do, and around 100x the outstanding bugs.

> If you want
> something really complicated, drop into a PL-based solution. The stuff
> I list above I see regular requests for at *every* PG installation I've
> ever been involved in, and it would be fantastic if they were available
> out of the box.

I don't think that anyone is talking about not adding this to core.
It's just a question of how we add it. In fact, it's mostly a question
of syntax.

> obviously go away. (The main reason I haven't pushed for us to submit
> our customizations here is that I know perfectly well the GUC-based UI
> isn't acceptable, but I haven't been able to get a better one done yet)

Well, now you can help Aster. ;-)

> If I were reviewing this I'd just
> kick it back as "separate these cleanly into separate patches where the
> partitioning one depends on the logging one" before even starting to
> look at the code, it's too much stuff to consume properly in one gulp.

Well, Bruce was supposed to be helping them submit it. And why *aren't*
you reviewing it?

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-11 22:13:17
Message-ID: 4AAACB7D.6080708@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
> Greg,
>
>
>> The performance of every path to get data into the database besides COPY
>> is too miserable for us to use anything else, and the current
>> inflexibility makes it useless for anything but the cleanest input data.
>>
>
> One potential issue we're facing down this road is that current COPY has
> a dual purpose: for database restore, and for importing and exporting
> data. At some point, we may want to separate those two behaviors,
> because we'll be adding bells and fringes to import/export which slow
> down overall performance or add bugs.
>
>

Nothing that has been proposed will slow down existing behaviour AFAIK.
The new behaviour will be slower in most cases, but that's a different
matter.

cheers

andrew


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 22:35:46
Message-ID: 4AAAD0C2.2060800@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> The full set of new behavior here I'd like to see allows adjusting:
>
> -Accept or reject rows with extra columns?
> -Accept or reject rows that are missing columns at the end?
> --Fill them with the default for the column (if available) or NULL?
> -Save rejected rows?
> --To a single system table?
> --To a user-defined table?
> --To the database logs?
The proposed patch save all rejected rows (with extra or missing
columns) to a user-defined table (that can be created automatically). If
you want to handle these bad rows on the fly, I guess you could have a
trigger on the error table that does the appropriate processing
depending on the data you are processing. In that case, having multiple
error tables allows you to plug different triggers to handle possible
error cases differently. The other option is to process the error table
after COPY to handle the bad rows.
I guess the problem with extra or missing columns is to make sure that
you know exactly which data belongs to which column so that you don't
put data in the wrong columns which is likely to happen if this is fully
automated.

I will try to re-read the thread on your proposal to better understand
how you figure out which rows are missing or extra.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 22:36:56
Message-ID: 20090911223656.GO17756@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> Integrating hstore into core and then
> making COPY able to execute a subquery to get its options is certainly
> not easier than a straightforward grammar modification; it's taking a
> small project and turning it into several big ones.

To be honest, my comment was more intended to support hstore in core in
general than this proposal. I would like to see it happen, if possible,
because I see alot of value in hstore.

Thanks,

Stephen


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 22:56:42
Message-ID: 4AAAD5AA.8080706@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> http://developer.postgresql.org/pgdocs/postgres/sql-explain.html
>
Just out of curiosity, it looks like I could write something like:
EXPLAIN (ANALYZE TRUE, COSTS FALSE, VERBOSE TRUE, COSTS TRUE) statement

What is the expected behavior if someone puts multiple time the same
option with different values. The last value prevails?
I know that this example looks stupid but when you have a lot of options
it sometimes happen that you put twice an option with different values
(that happens with some JDBC driver options...).

manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 23:04:30
Message-ID: 4AAAD77E.2000907@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/11/09 3:56 PM, Emmanuel Cecchet wrote:
> Robert Haas wrote:
>> http://developer.postgresql.org/pgdocs/postgres/sql-explain.html
>>
> Just out of curiosity, it looks like I could write something like:
> EXPLAIN (ANALYZE TRUE, COSTS FALSE, VERBOSE TRUE, COSTS TRUE) statement
>
> What is the expected behavior if someone puts multiple time the same
> option with different values. The last value prevails?

Yes.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-11 23:04:48
Message-ID: 603c8f070909111604y4f5d295ekbea61b3a9aae62cd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 11, 2009 at 6:56 PM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Robert Haas wrote:
>>
>> http://developer.postgresql.org/pgdocs/postgres/sql-explain.html
>>
>
> Just out of curiosity, it looks like I could write something like:
> EXPLAIN (ANALYZE TRUE, COSTS FALSE, VERBOSE TRUE, COSTS TRUE) statement
>
> What is the expected behavior if someone puts multiple time the same option
> with different values. The last value prevails?

Yes.

...Robert


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 07:07:13
Message-ID: 4AAB48A1.80007@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> The performance of every path to get data into the database besides COPY
>> is too miserable for us to use anything else, and the current
>> inflexibility makes it useless for anything but the cleanest input data.
>
> One potential issue we're facing down this road is that current COPY has
> a dual purpose: for database restore, and for importing and exporting
> data. At some point, we may want to separate those two behaviors,
> because we'll be adding bells and fringes to import/export which slow
> down overall performance or add bugs.

+1. There is an infinite number of bells and whistles we could add to
COPY, and there's also a number of further optimizations that would make
the loading faster. But the code is quite a mess already, because it's
already highly optimized at the expense of readibility. We need to
separate the input parsing from the fast bulk insertion.

Letting external modules replace the input parsing part would allow you
to a write parser for any input format you like. You could even get the
input from a different source altogether, like from another database via
dblink, in a binary format of some sort.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 07:12:23
Message-ID: 4AAB49D7.50002@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> The user-defined table for rejects is obviously exclusive of the system
>> one, either of those would be fine from my perspective.
>
> I've been thinking about it, and can't come up with a really strong case
> for wanting a user-defined table if we settle the issue of having a
> strong key for pg_copy_errors. Do you have one?

A user-defined error callback function would be nice. The function could
insert the erroneous rows to a table, write to a file, print a warning,
perform further checks, or whatever. This assumes that errors are seldom
enough that the error-path is not performance-critical.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 08:22:01
Message-ID: alpine.GSO.2.01.0909120324010.9961@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 11 Sep 2009, Josh Berkus wrote:

> I've been thinking about it, and can't come up with a really strong case
> for wanting a user-defined table if we settle the issue of having a
> strong key for pg_copy_errors. Do you have one?

No, but I'd think that if the user table was only allowed to be the exact
same format as the system one it wouldn't be that hard to implement--once
the COPY syntax is expanded at least. I'm reminded of how Oracle EXPLAIN
PLANs get logged into the PLAN_TABLE by default, but you can specify "INTO
table" to put them somewhere else. You'd basically doing the same thing
but with a different destination relation.

> After some thought, I think that Andrew's feature *is* generally
> applicable, if done as IGNORE COLUMN COUNT (or, more likely,
> column_count=ignore). I can think of a lot of data sets where column
> count is jagged and you want to do ELT instead of ETL.

Exactly, the ELT approach gives you so many more options for cleaning up
the data that I think it would be used more if it weren't so hard to
do in Postgres right now.

> As opposed to Tom, Peter and Heikki vetoing things because the feature
> gain doesn't justify the maintnenance burden? That's your real choice.
> Adding a framework for manageable syntax extensions means that we can be
> more liberal about what we justify as an extension.

I think you're not talking at the distinction I was trying to make. The
work to make the *syntax* for COPY easier to extend is an unfortunate
requirement for all these new bits; no arguments from me that using GUCs
for everything is just too painful

What I was suggesting is that the first set of useful features required
for what you're calling the ELT load path is both small and well
understood. An implementation of the stuff I see a constant need for
could get banged out so fast that trying to completely generalize it on
the first pass has a questionable return.

While complicated, COPY is a pretty walled off command of around 3500
lines of code, and the hackery required here is pretty small. For
example, it turns out we do already have the code to get it to ignore
column overruns here, and it's all of 50 new lines--much of which is
shared with code that does other error ignoring bits too. It's easy to
make a case for a grand future extensibility cleanup here, but it's really
not necessary to provide a significant benefit here for the cases I
mentioned. And I would guess the maintenance burden of a more general
solution has to be higher than a simple implementation of the feature list
I gave in my last message.

In short: there's a presumption that adding any error-ignoring code would
require significant contortions. I don't think that's really true though,
and would like to keep open the possibilty of accepting some simple but
useful ad-hoc features in this area, even if they don't solve every
possible problem in this space just yet.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-12 08:43:46
Message-ID: alpine.GSO.2.01.0909120430040.9961@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:

> I guess the problem with extra or missing columns is to make sure that
> you know exactly which data belongs to which column so that you don't
> put data in the wrong columns which is likely to happen if this is fully
> automated.

Allowing the extra column case is easy: everwhere in copy.c you find the
error message "extra data after last expected column", just ignore the
overflow fields rather than rejecting the line just based on that. And
the default information I mentioned you might want to substitute for
missing columns is already being collected by the code block with the
comment "Get default info if needed".

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 15:13:39
Message-ID: 4AABBAA3.30604@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
>> After some thought, I think that Andrew's feature *is* generally
>> applicable, if done as IGNORE COLUMN COUNT (or, more likely,
>> column_count=ignore). I can think of a lot of data sets where column
>> count is jagged and you want to do ELT instead of ETL.
>
> Exactly, the ELT approach gives you so many more options for cleaning
> up the data that I think it would be used more if it weren't so hard
> to do in Postgres right now.
>
>

+1. That's exactly what my client wants to do. They know perfectly well
that they get junk data. They want to get it into the database with a
minimum of fuss where they will have the right tools for checking and
cleaning it. If they have to spend effort whacking it into shape just to
get it into the database, then their cleanup effort essentially has to
be done in two pieces, part inside and part outside the database.

>
> While complicated, COPY is a pretty walled off command of around 3500
> lines of code, and the hackery required here is pretty small. For
> example, it turns out we do already have the code to get it to ignore
> column overruns here, and it's all of 50 new lines--much of which is
> shared with code that does other error ignoring bits too. It's easy to
> make a case for a grand future extensibility cleanup here, but it's
> really not necessary to provide a significant benefit here for the
> cases I mentioned. And I would guess the maintenance burden of a more
> general solution has to be higher than a simple implementation of the
> feature list I gave in my last message.
>
> In short: there's a presumption that adding any error-ignoring code
> would require significant contortions. I don't think that's really
> true though, and would like to keep open the possibilty of accepting
> some simple but useful ad-hoc features in this area, even if they
> don't solve every possible problem in this space just yet.
>
>

Right. What I proposed would not have been terribly invasive or
difficult, certainly less so than what seems to be our direction by an
order of magnitude at least. I don't for a moment accept the assertion
that we can get a general solution for the same effort.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 15:23:57
Message-ID: 5843.1252769037@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Right. What I proposed would not have been terribly invasive or
> difficult, certainly less so than what seems to be our direction by an
> order of magnitude at least. I don't for a moment accept the assertion
> that we can get a general solution for the same effort.

And at the same time, Greg's list of minimum requirements was far
longer than what you proposed to do. We can *not* just implement
those things one at a time with no thought towards what the full
solution looks like --- at least not if we want the end result to
look like it was intelligently designed, not merely accreted.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 15:44:01
Message-ID: 4AABC1C1.20205@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Right. What I proposed would not have been terribly invasive or
>> difficult, certainly less so than what seems to be our direction by an
>> order of magnitude at least. I don't for a moment accept the assertion
>> that we can get a general solution for the same effort.
>>
>
> And at the same time, Greg's list of minimum requirements was far
> longer than what you proposed to do. We can *not* just implement
> those things one at a time with no thought towards what the full
> solution looks like --- at least not if we want the end result to
> look like it was intelligently designed, not merely accreted.
>
>
>

I don't disagree with that.

At the same time, I think it's probably not a good thing that users who
deal with very large amounts of data would be forced off the COPY fast
path by a need for something like input support for non-rectangular
data. It probably won't affect my clients too much in this instance, but
then their largest loads are usually of the order of only 50,000 records
or so. I understand Truviso has handled this by a patch that does the
sort of stuff Greg was talking about.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 15:57:00
Message-ID: 6412.1252771020@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> At the same time, I think it's probably not a good thing that users who
> deal with very large amounts of data would be forced off the COPY fast
> path by a need for something like input support for non-rectangular
> data.

[ shrug... ] Everybody in the world is going to want their own little
problem to be handled in the fast path. And soon it won't be so fast
anymore. I think it is perfectly reasonable to insist that the fast
path is only for "clean" data import.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-12 21:51:49
Message-ID: alpine.GSO.2.01.0909121742180.18054@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 12 Sep 2009, Tom Lane wrote:

> Everybody in the world is going to want their own little problem to be
> handled in the fast path. And soon it won't be so fast anymore. I
> think it is perfectly reasonable to insist that the fast path is only
> for "clean" data import.

The extra overhead is that when you hit the checks that are already in the
code, where the row would normally be rejected, there's a second check as
to whether that particular problem is considered OK or not. There won't
be any additional overhead for clean imports. As I was pointing out in
one of the messages in this thread, all of the expensive things you need
are already being done.

As for "everybody in the world" wanting a specific fix for their private
problems, I assure that everything I suggested comes up constantly on
every legacy data conversion or import job I see. This is not stuff that
fits a one-off need, these are things that make it harder for people to
adopt PostgreSQL all the time. I wouldn't care about this one bit if
these particular issues didn't ruin my day constantly. Consider the fact
that we're looking at three samples of people who have either already
written a patch in this area or considered writing one showing up just
among people on the hackers list. That should be hint as to how common
these requests are.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-13 21:17:51
Message-ID: 4AAD617F.2070104@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> [ shrug... ] Everybody in the world is going to want their own little
> problem to be handled in the fast path. And soon it won't be so fast
> anymore. I think it is perfectly reasonable to insist that the fast
> path is only for "clean" data import.

Why?

No, really.

It's not as if we don't have the ability to measure performance impact.
It's reasonable to make a requirement that new options to COPY
shouldn't slow it down noticeably if those options aren't used. And we
can test that, and even make such testing part of the patch review.

But ... fault-tolerant COPY is one of our biggest user
requests/complaints. At user group meetings and the like, I get asked
about it probably every third gathering of users I'm at. While it's not
as critical as log-based replication, it's also not nearly as hard to
integrate and review.

I fully support the idea that we need to have the extended syntax for
these new COPY options. But we should make COPY take an alternate path
for fault-tolerant COPY only if it's shown that adding these options
slows down database restore.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-13 21:33:59
Message-ID: 20524.1252877639@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> It's not as if we don't have the ability to measure performance impact.
> It's reasonable to make a requirement that new options to COPY
> shouldn't slow it down noticeably if those options aren't used. And we
> can test that, and even make such testing part of the patch review.

Really? Where is your agreed-on, demonstrated-to-be-reproducible
benchmark for COPY speed?

My experience is that reliably measuring performance costs in the
percent-or-so range is *hard*. It's only after you've added a few of
them and they start to mount up that it becomes obvious that all those
insignificant additions really did cost something.

But in any case, I think that having a clear distinction between
"straight data import" and "data transformation" features is a good
thing. COPY is already pretty much of an unmanageable monstrosity,
and continuing to accrete features into it without any sort of structure
is something we are going to regret.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-14 00:17:45
Message-ID: 4AAD8BA9.6050800@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>
>> It's not as if we don't have the ability to measure performance impact.
>> It's reasonable to make a requirement that new options to COPY
>> shouldn't slow it down noticeably if those options aren't used. And we
>> can test that, and even make such testing part of the patch review.
>>
>
> Really? Where is your agreed-on, demonstrated-to-be-reproducible
> benchmark for COPY speed?
>
> My experience is that reliably measuring performance costs in the
> percent-or-so range is *hard*. It's only after you've added a few of
> them and they start to mount up that it becomes obvious that all those
> insignificant additions really did cost something.
>

Well, I strongly suspect that the cost of the patch I'm working with is
not in the percent-or-so range, and much more likely to be in the
tiny-fraction-of-a-percent range. The total overhead in the non-ragged
case is one extra test per field, plus one per null field, plus two
tests per line.

But since you raise the question I'll conduct some tests and then you
can criticize those. Ruling out tests a priori seems a bit extreme.

The current patch is attached for information (and in case anyone else
wants to try some testing).

cheers

andrew

Attachment Content-Type Size
raggedcopy.patch text/x-patch 8.4 KB

From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-14 18:41:33
Message-ID: 4AAE8E5D.1030203@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:
>
>> I guess the problem with extra or missing columns is to make sure
>> that you know exactly which data belongs to which column so that you
>> don't put data in the wrong columns which is likely to happen if this
>> is fully automated.
>
> Allowing the extra column case is easy: everwhere in copy.c you find
> the error message "extra data after last expected column", just ignore
> the overflow fields rather than rejecting the line just based on
> that. And the default information I mentioned you might want to
> substitute for missing columns is already being collected by the code
> block with the comment "Get default info if needed".
If I understand it well, you expect the garbage to be after the last
column. But what if the extra or missing column is somewhere upfront or
in the middle? Sometimes you might have a type conflict problem that
will help you detect the problem, sometimes you will just insert
garbage. This might call for another mechanism that would log the lines
that are automatically 'adjusted' to be able to rollback any mistake
that might happen during this automated process.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-14 19:37:53
Message-ID: 4AAE9B91.80805@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet wrote:
> Greg Smith wrote:
>> On Fri, 11 Sep 2009, Emmanuel Cecchet wrote:
>>
>>> I guess the problem with extra or missing columns is to make sure
>>> that you know exactly which data belongs to which column so that you
>>> don't put data in the wrong columns which is likely to happen if
>>> this is fully automated.
>>
>> Allowing the extra column case is easy: everwhere in copy.c you find
>> the error message "extra data after last expected column", just
>> ignore the overflow fields rather than rejecting the line just based
>> on that. And the default information I mentioned you might want to
>> substitute for missing columns is already being collected by the code
>> block with the comment "Get default info if needed".
> If I understand it well, you expect the garbage to be after the last
> column. But what if the extra or missing column is somewhere upfront
> or in the middle? Sometimes you might have a type conflict problem
> that will help you detect the problem, sometimes you will just insert
> garbage. This might call for another mechanism that would log the
> lines that are automatically 'adjusted' to be able to rollback any
> mistake that might happen during this automated process.
>
>

Garbage off to the right is exactly the case that we have. Judging from
what I'm hearing a number of other people are too.

Nobody suggests that a facility to ignore extra columns will handle
every case. It will handle what increasingly appears to be a common case.

cheers

andrew


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-18 04:14:18
Message-ID: 4AB3091A.5020506@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here is a new version of error logging and autopartitioning in COPY
based on the latest COPY patch that provides the new syntax for copy
options (this patch also includes the COPY option patch).

New features compared to previous version:
- removed the GUC variables for error logging and use copy options
instead (renamed options as suggested by Josh)
- added documentation and examples (see copy.sgml)
- partitioning also activated by copy option rather than GUC variable
(renamed as well)
- added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the
COPY operation if a certain number of bad rows has been encountered.
- updated unit tests

I also tried to update the wiki pages but it's late and the doc is
probably better for now.

This addresses most of the comments so far except for the format of the
error table (lack of natural key) and a possible pg_error_table as a
default name.

Emmanuel

Emmanuel Cecchet wrote:
> Hi all,
>
> Finally the error logging and autopartitioning patches for COPY that I
> presented at PGCon are here!
> Error logging is described here:
> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
> Autopartitioning is described here:
> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>
> The attached patch contains both contribs as well as unit tests. I
> will submit shortly the new patch at commitfest.postgresql.org.
>
> Thanks in advance for your feedback,
> Emmanuel
>
> ------------------------------------------------------------------------
>
>
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Attachment Content-Type Size
aster-copy-newsyntax-patch-8.5v4context.txt.gz application/x-gzip 43.2 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY enhancements
Date: 2009-09-19 19:55:28
Message-ID: 200909191955.n8JJtSd17929@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > It's not as if we don't have the ability to measure performance impact.
> > It's reasonable to make a requirement that new options to COPY
> > shouldn't slow it down noticeably if those options aren't used. And we
> > can test that, and even make such testing part of the patch review.
>
> Really? Where is your agreed-on, demonstrated-to-be-reproducible
> benchmark for COPY speed?
>
> My experience is that reliably measuring performance costs in the
> percent-or-so range is *hard*. It's only after you've added a few of
> them and they start to mount up that it becomes obvious that all those
> insignificant additions really did cost something.
>
> But in any case, I think that having a clear distinction between
> "straight data import" and "data transformation" features is a good
> thing. COPY is already pretty much of an unmanageable monstrosity,
> and continuing to accrete features into it without any sort of structure
> is something we are going to regret.

I have read up on this thread and the new copy syntax thread. I think
there is clearly documented demand for such extensions to COPY.

We are definitely opening the floodgates by allowing COPY to process
invalid data. I think everyone admits COPY is already quite
complicated, both in its API and C code.

If we are going to add to COPY, I think we need to do it in a way that
has a clean user API, doesn't make the C code any more complicated, and
doesn't introduce a performance impact for people not using these new
features. If we don't do that, we are going to end up like 'bcp' that
is perpetually buggy, as someone explained.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-25 02:38:46
Message-ID: 603c8f070909241938kb41865o2ae9fc46526183c3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Here is a new version of error logging and autopartitioning in COPY based on
> the latest COPY patch that provides the new syntax for copy options (this
> patch also includes the COPY option patch).
>
> New features compared to previous version:
> - removed the GUC variables for error logging and use copy options instead
> (renamed options as suggested by Josh)
> - added documentation and examples (see copy.sgml)
> - partitioning also activated by copy option rather than GUC variable
> (renamed as well)
> - added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
> operation if a certain number of bad rows has been encountered.
> - updated unit tests
>
> I also tried to update the wiki pages but it's late and the doc is probably
> better for now.
>
> This addresses most of the comments so far except for the format of the
> error table (lack of natural key) and a possible pg_error_table as a default
> name.
>
> Emmanuel
>
> Emmanuel Cecchet wrote:
>>
>> Hi all,
>>
>> Finally the error logging and autopartitioning patches for COPY that I
>> presented at PGCon are here!
>> Error logging is described here:
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>> Autopartitioning is described here:
>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>
>> The attached patch contains both contribs as well as unit tests. I will
>> submit shortly the new patch at commitfest.postgresql.org.
>>
>> Thanks in advance for your feedback,
>> Emmanuel

This no longer applies.

...Robert


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-25 03:00:35
Message-ID: 4ABC3253.6010407@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Yes, I have to update the patch following what Tom already integrated of
the COPY patch.
I will get a new version posted as soon as I can.

Emmanuel

Robert Haas wrote:
> On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
>
>> Here is a new version of error logging and autopartitioning in COPY based on
>> the latest COPY patch that provides the new syntax for copy options (this
>> patch also includes the COPY option patch).
>>
>> New features compared to previous version:
>> - removed the GUC variables for error logging and use copy options instead
>> (renamed options as suggested by Josh)
>> - added documentation and examples (see copy.sgml)
>> - partitioning also activated by copy option rather than GUC variable
>> (renamed as well)
>> - added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
>> operation if a certain number of bad rows has been encountered.
>> - updated unit tests
>>
>> I also tried to update the wiki pages but it's late and the doc is probably
>> better for now.
>>
>> This addresses most of the comments so far except for the format of the
>> error table (lack of natural key) and a possible pg_error_table as a default
>> name.
>>
>> Emmanuel
>>
>> Emmanuel Cecchet wrote:
>>
>>> Hi all,
>>>
>>> Finally the error logging and autopartitioning patches for COPY that I
>>> presented at PGCon are here!
>>> Error logging is described here:
>>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>> Autopartitioning is described here:
>>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>>
>>> The attached patch contains both contribs as well as unit tests. I will
>>> submit shortly the new patch at commitfest.postgresql.org.
>>>
>>> Thanks in advance for your feedback,
>>> Emmanuel
>>>
>
> This no longer applies.
>
> ...Robert
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-25 14:01:27
Message-ID: 4ABCCD37.4000304@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert,

Here is the new version of the patch that applies to CVS HEAD as of this
morning.

Emmanuel

> On Fri, Sep 18, 2009 at 12:14 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
>
>> Here is a new version of error logging and autopartitioning in COPY based on
>> the latest COPY patch that provides the new syntax for copy options (this
>> patch also includes the COPY option patch).
>>
>> New features compared to previous version:
>> - removed the GUC variables for error logging and use copy options instead
>> (renamed options as suggested by Josh)
>> - added documentation and examples (see copy.sgml)
>> - partitioning also activated by copy option rather than GUC variable
>> (renamed as well)
>> - added a new ERROR_LOGGING_MAX_ERRORS option that allows to abort the COPY
>> operation if a certain number of bad rows has been encountered.
>> - updated unit tests
>>
>> I also tried to update the wiki pages but it's late and the doc is probably
>> better for now.
>>
>> This addresses most of the comments so far except for the format of the
>> error table (lack of natural key) and a possible pg_error_table as a default
>> name.
>>
>> Emmanuel
>>
>> Emmanuel Cecchet wrote:
>>
>>> Hi all,
>>>
>>> Finally the error logging and autopartitioning patches for COPY that I
>>> presented at PGCon are here!
>>> Error logging is described here:
>>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>> Autopartitioning is described here:
>>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>>
>>> The attached patch contains both contribs as well as unit tests. I will
>>> submit shortly the new patch at commitfest.postgresql.org.
>>>
>>> Thanks in advance for your feedback,
>>> Emmanuel
>>>
>
> This no longer applies.
>
> ...Robert
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Attachment Content-Type Size
aster-copy-newsyntax-patch-8.5v5context.txt.gz application/x-gzip 35.0 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-04 17:43:18
Message-ID: 1254678198.25650.2.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2009-09-25 at 10:01 -0400, Emmanuel Cecchet wrote:
> Robert,
>
> Here is the new version of the patch that applies to CVS HEAD as of this
> morning.

I just started looking at this now. It seems to fail "make check", diffs
attached. I haven't looked into the cause of the failure yet.

Regards,
Jeff Davis

Attachment Content-Type Size
regression.diffs text/x-patch 2.0 KB

From: Selena Deckelmann <selenamarie(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-04 21:32:15
Message-ID: 2b5e566d0910041432g4ebc17feq6f49998d4622fa94@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Fri, Sep 25, 2009 at 7:01 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:

> Here is the new version of the patch that applies to CVS HEAD as of this
> morning.

Cool features!

This is my first pass at the error logging portion of this patch. I'm
going to take a break and try to go through the partitioning logic as
well later this afternoon.

caveat: I'm not familiar with most of the code paths that are being
touched by this patch.

Overall:
* I noticed '\see' included in the comments in your code. These should
be removed.
* The regression is failling, as Jeff indicated, and I didn't figure
out why yet either. Hopefully will have a look closer this afternoon.

Comments:
* copy.c: Better formatting, maybe rewording needed for comment
starting on line 1990.
** Maybe say: "Check that errorData->sqlerrcode only logged tuples
that are malformed. This ensures that we let other errors pass
through."
* copy.c: line: 2668 -> need to fix that comment :) (/* Regular code
*/) -- this needs to be more descriptive of what is actually
happening. We fell through the partitioning logic, right, and back to
the default behavior?
* copy.c: line 2881, maybe say instead: "Mark that we have not read a
line yet. This is necessary so that we can perform error logging on
complete lines only."

Formatting:
* copy.c: whitespace is maybe a little off at line: 2004-2009
* NEW FILES: src/backend/utils/misc/errorlogging.c & errorlogging.h need headers

Code:
* copy.c: line 1990 -> cur_lineno_str[11] & related: why is this
conversion necessary? (sorry if that is a dumb question)
* copy.c: line 2660 -> what if we are error logging for copy? Does
this get logged properly?
* errorlogging.c: Noticed the comment at 503 -> 'note that we always
enable wal logging'.. Thinking this through - the reasoning is that
you won't be rolling back the error logging no matter what, right?
* src/include/commands/copy.c and copy.h: struct CopyStateData was
moved from copy.c to copy.h; this made sense to me, just noting. That
move made it a little tricky to find the changes that were made. There
were 10 items added.
** super nit pick: 'readlineOk' uses camel-case instead of underscores
like the rest of the new variables
* errorlogging.c: could move currentCommand pg_stat call in Init
routine into MalformedTupleIs, or maybe move the setting of that
parameter into the Init routine for consistency's sake.

Documentation:
* doc/src/sgml/ref/copy.sgml: line 313: 'failif' needs a space
** Also: The error table log examples have relations shown in a
different order than the actual CREATE TABLE declaration in the code.
* src/test/regress/sql/copyselect.sql: Format of options changed..
added parenthesis. Is this change documented elsewhere? (sorry if I
just missed this in the rest of the thread/patch)

--
http://chesnok.com/daily - me
http://endpoint.com - work


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-05 02:08:11
Message-ID: 4AC9550B.8080009@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The problem comes from the foo_malformed_terminator.data file.
It is supposed to have a malformed terminator that was not catch by
patch. The second line should look like:
2 two^M

If it does not, you can edit it with emacs, go at the end of the second
line and press Ctrl+q followed by Ctrl+m and that will do the trick.
I am attaching a copy of the file, hoping that the attachment will
arrive with the malformed terminator in your inbox!

manu

Jeff Davis wrote:
> On Fri, 2009-09-25 at 10:01 -0400, Emmanuel Cecchet wrote:
>
>> Robert,
>>
>> Here is the new version of the patch that applies to CVS HEAD as of this
>> morning.
>>
>
> I just started looking at this now. It seems to fail "make check", diffs
> attached. I haven't looked into the cause of the failure yet.
>
> Regards,
> Jeff Davis
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Attachment Content-Type Size
foo_malformed_terminator.data text/plain 35 bytes

From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Selena Deckelmann <selenamarie(at)gmail(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-05 15:30:48
Message-ID: 4ACA1128.9000903@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Selena,

> This is my first pass at the error logging portion of this patch. I'm
> going to take a break and try to go through the partitioning logic as
> well later this afternoon.
>
> caveat: I'm not familiar with most of the code paths that are being
> touched by this patch.
>
> Overall:
> * I noticed '\see' included in the comments in your code. These should
> be removed.
>
Should we avoid any doxygen tags in general in the Postgres code?
> * The regression is failling, as Jeff indicated, and I didn't figure
> out why yet either. Hopefully will have a look closer this afternoon.
>
Let me know if the response I sent to Jeff works for you.
> Comments:
> * copy.c: Better formatting, maybe rewording needed for comment
> starting on line 1990.
>
Some of the bad formatting has been left on purpose to minimize the size
of the patch otherwise there would have been many tab/white spaces
changes due to the indentation in the PG_TRY blocks. I suggest that
whoever is going to commit the code runs pg_indent or I can fix the
formatting once the reviewing is done.
> ** Maybe say: "Check that errorData->sqlerrcode only logged tuples
> that are malformed. This ensures that we let other errors pass
> through."
>
Ok.
> * copy.c: line: 2668 -> need to fix that comment :) (/* Regular code
> */) -- this needs to be more descriptive of what is actually
> happening. We fell through the partitioning logic, right, and back to
> the default behavior?
>
Yes.
> * copy.c: line 2881, maybe say instead: "Mark that we have not read a
> line yet. This is necessary so that we can perform error logging on
> complete lines only."
>
Ok.
> Formatting:
> * copy.c: whitespace is maybe a little off at line: 2004-2009
> * NEW FILES: src/backend/utils/misc/errorlogging.c & errorlogging.h need headers
>
I was not sure what is auto-generated by SVN commit scripts. I'd be
happy to add headers. Is there a specification somewhere or should I
just copy/paste from another file?
> Code:
> * copy.c: line 1990 -> cur_lineno_str[11] & related: why is this
> conversion necessary? (sorry if that is a dumb question)
>
This conversion is necessary if you log in the error table the index of
the row that is causing the error (this is what is logged by default in
ERROR_LOGGING_KEY).
> * copy.c: line 2660 -> what if we are error logging for copy? Does
> this get logged properly?
>
Yes, this is in a try/catch block that performs error logging.
> * errorlogging.c: Noticed the comment at 503 -> 'note that we always
> enable wal logging'.. Thinking this through - the reasoning is that
> you won't be rolling back the error logging no matter what, right?
>
I think that this was the original idea but we should probably rollback
the error logging if the command has been rolled back. It might be more
consistent to use the same hi_options as the copy command. Any idea what
would be best?
> * src/include/commands/copy.c and copy.h: struct CopyStateData was
> moved from copy.c to copy.h; this made sense to me, just noting. That
> move made it a little tricky to find the changes that were made. There
> were 10 items added.
>
Yes, patch can be a little bit lost when you move a big data structure
like this one.
> ** super nit pick: 'readlineOk' uses camel-case instead of underscores
> like the rest of the new variables
>
Yes, queryDesc also has camel-case. I will fix readlineOk.
> * errorlogging.c: could move currentCommand pg_stat call in Init
> routine into MalformedTupleIs, or maybe move the setting of that
> parameter into the Init routine for consistency's sake.
>
The advantage of calling pg_stat in InitializeErrorLogging is that it
is evaluated only once for the entire copy command (and it's not going
to change during the execution of the command). I am not sure to
understand what your second suggestion is since currentCommand is set
and initialized in Init.
> Documentation:
> * doc/src/sgml/ref/copy.sgml: line 313: 'failif' needs a space
>
ok
> ** Also: The error table log examples have relations shown in a
> different order than the actual CREATE TABLE declaration in the code.
>
The order of the columns does not really matter but for consistency sake
we can put the same order.
> * src/test/regress/sql/copyselect.sql: Format of options changed..
> added parenthesis. Is this change documented elsewhere? (sorry if I
> just missed this in the rest of the thread/patch)
>
Yes this has already been committed by Tom. The new format of options
has been introduced just before this patch.

I am attaching a revised version of the patch.
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

Attachment Content-Type Size
aster-copy-newsyntax-patch-8.5v6context.txt.gz application/x-gzip 35.3 KB

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-05 17:01:20
Message-ID: 4ACA2660.9080600@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel,

> I think that this was the original idea but we should probably rollback
> the error logging if the command has been rolled back. It might be more
> consistent to use the same hi_options as the copy command. Any idea what
> would be best?

Well, if we're logging to a file, you wouldn't be *able* to roll them
back. Also, presumbly, if you abort a COPY because of errors, you
probably want to keep the errors around for later analysis. No?

--Josh Berkus


From: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 02:07:31
Message-ID: 4ACBF7E3.3090306@frogthinker.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just realized that I forgot to CC the list when I answered to Josh... resending!

Josh,

>> >> I think that this was the original idea but we should probably rollback
>> >> the error logging if the command has been rolled back. It might be more
>> >> consistent to use the same hi_options as the copy command. Any idea what
>> >> would be best?
>> >>
>>
> >
> > Well, if we're logging to a file, you wouldn't be *able* to roll them
> > back. Also, presumbly, if you abort a COPY because of errors, you
> > probably want to keep the errors around for later analysis. No?
> >
>
You are right about the file (though this functionality is not
implemented yet).
I don't have any strong opinion about the right behavior if COPY aborts.
The error log table would contain tuples that were not inserted anyway.
Now knowing whether the bad tuples come from a successful COPY command
or not will be left to the user.

Emmanuel
-- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 07:17:20
Message-ID: alpine.GSO.2.01.0910070255190.16948@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 5 Oct 2009, Josh Berkus wrote:

>> I think that this was the original idea but we should probably rollback
>> the error logging if the command has been rolled back. It might be more
>> consistent to use the same hi_options as the copy command. Any idea what
>> would be best?
>
> Well, if we're logging to a file, you wouldn't be *able* to roll them
> back. Also, presumbly, if you abort a COPY because of errors, you
> probably want to keep the errors around for later analysis. No?

Absolutely, that's the whole point of logging to a file in the first
place. What needs to happen here is that when one is aborted, you need to
make sure that fact is logged, and with enough information (the pid?) to
tie it to the COPY that failed. Then someone can crawl the logs to figure
out what happened and what data did and didn't get loaded. Ideally you'd
want to have that as database table information instead, to allow
automated recovery and re-commit in the cases where the error wasn't
inherent in the data but instead some other type of database failure.

I know this patch is attracting more reviewers lately, is anyone tracking
the general architecture of the code yet? Emmanuel's work is tough to
review just because there's so many things mixed together, and there's
other inputs I think should be considered at the same time while we're all
testing in there (such as the COPY patch Andrew Dunstan put together).

What I'd like to see is for everything to get broken more into component
chunks that can get commited and provide something useful one at a time,
because I doubt taskmaster Robert is going to let this one linger around
with scope creep for too long before being pushed out to the next
CommitFest. It would be nice to have a clear game plan that involves the
obvious 3 smaller subpatches that can be commited one at a time as they're
ready to focus the work on, so that something might be polished enough for
this CF. And, yes, I'm suggesting work only because I now have some time
to help with that if the idea seems reasonable to persue. Be glad to set
up a public git repo or something to serve as an integration point for
dependency merge management and testing that resists bit-rot while
splitting things up functionally.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 10:37:54
Message-ID: 1254911874.26302.213.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2009-10-07 at 03:17 -0400, Greg Smith wrote:
> On Mon, 5 Oct 2009, Josh Berkus wrote:
>
> Also, presumbly, if you abort a COPY because of errors, you
> > probably want to keep the errors around for later analysis. No?
>
> Absolutely, that's the whole point of logging to a file in the first
> place.

Yes, essential.

(Not read patch, so some later comments may misunderstand where we're
at)

> What needs to happen here is that when one is aborted, you need to
> make sure that fact is logged, and with enough information (the pid?) to
> tie it to the COPY that failed. Then someone can crawl the logs to figure
> out what happened and what data did and didn't get loaded. Ideally you'd
> want to have that as database table information instead, to allow
> automated recovery and re-commit in the cases where the error wasn't
> inherent in the data but instead some other type of database failure.

Adding something to the logs is the wrong place for that IMHO. If we do
write a log message it should be a NOTICE not a LOG.

If the user specifies an error file then it should be straightforward
for them to look directly in that error file afterwards to see if there
are rejects. It's typical to only create the error file if rejects
exist, to allow a simple if-file-exists test after the COPY.

I don't recommend having the server automatically generate an error file
name because that will encourage conflicts between multiple users on
production systems. If the user wants an errorfile they should specify
it, that way they will know the name.

It will be best to have the ability to have a specific rejection reason
for each row rejected. That way we will be able to tell the difference
between uniqueness violation errors, invalid date format on col7, value
fails check constraint on col22 etc..

An implicit "I got an error on this record" isn't enough information and
can lead to chaos, since you may be getting more than one error on a
record and we need to be able to fix them one at a time. If we can't
tell what the error message is then we might think our first valid fix
actually failed and start looking for other solutions.

There are security implications of writing stuff to an error file, so
will the error file option still be available when we do \copy or COPY
FROM STDIN, and if so how will it work? With what restrictions?

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 11:07:36
Message-ID: 603c8f070910070407l25d4be7cgf902d5347f77dc3e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> I know this patch is attracting more reviewers lately, is anyone tracking
> the general architecture of the code yet?  Emmanuel's work is tough to
> review just because there's so many things mixed together, and there's other
> inputs I think should be considered at the same time while we're all testing
> in there (such as the COPY patch Andrew Dunstan put together).

I hadn't realized this was an issue, but I think it's a good point: a
patch that does one thing well is much more likely to get accepted
than a patch that does two things well, let alone two things poorly.
It's just much easier to review and verify. Or maybe the name of the
patch maybe should have tipped me off: "COPY enhancements" vs. "make
COPY have feature X".

> What I'd like to see is for everything to get broken more into component
> chunks that can get commited and provide something useful one at a time,
> because I doubt taskmaster Robert is going to let this one linger around
> with scope creep for too long before being pushed out to the next
> CommitFest.

I'm can't decide whether to feel good or bad about that appelation, so
I'm going with both. But in all seriousness if this patch needs
substantial reworking (which it sounds like it does) we should
postpone it to the next CF; we are quickly running out of days, and
it's not fair to reviewers or committers to ask for new reviews of
substantially revised code with a only a week to go.

...Robert


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 13:12:44
Message-ID: 4ACC93CC.1050509@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

I think there is a misunderstanding about what the current patch is about.
The patch includes 2 things:
- error logging in a table for bad tuples in a COPY operation (see
http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example;
the error message, command and so on are automatically logged)
- auto-partitioning in a hierarchy of child table if the COPY targets a
parent table.

The patch does NOT include:
- logging errors into a file (a feature we can add later on (next commit
fest?))

I can put the auto-partitioning in a separate patch if that helps but
this patch relies on error logging to log possible errors in the routing
of tuples.

I think that the way to move forward is first to have a basic error
logging facility that logs into a database table (like the current patch
does) and then add enhancements. I don't think we should try to do the
file logging at the same time. If you prefer to postpone the
auto-partitioning to the next commit fest, I can strip it from the
current patch and re-submit it for the next fest (but it's just 2
isolated methods really easy to review).

Emmanuel

Robert Haas wrote:
> On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>
>> I know this patch is attracting more reviewers lately, is anyone tracking
>> the general architecture of the code yet? Emmanuel's work is tough to
>> review just because there's so many things mixed together, and there's other
>> inputs I think should be considered at the same time while we're all testing
>> in there (such as the COPY patch Andrew Dunstan put together).
>>
>
> I hadn't realized this was an issue, but I think it's a good point: a
> patch that does one thing well is much more likely to get accepted
> than a patch that does two things well, let alone two things poorly.
> It's just much easier to review and verify. Or maybe the name of the
> patch maybe should have tipped me off: "COPY enhancements" vs. "make
> COPY have feature X".
>
>
>> What I'd like to see is for everything to get broken more into component
>> chunks that can get commited and provide something useful one at a time,
>> because I doubt taskmaster Robert is going to let this one linger around
>> with scope creep for too long before being pushed out to the next
>> CommitFest.
>>
>
> I'm can't decide whether to feel good or bad about that appelation, so
> I'm going with both. But in all seriousness if this patch needs
> substantial reworking (which it sounds like it does) we should
> postpone it to the next CF; we are quickly running out of days, and
> it's not fair to reviewers or committers to ask for new reviews of
> substantially revised code with a only a week to go.
>
> ...Robert
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 13:28:02
Message-ID: 4ACC9762.9070903@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet wrote:
> If you prefer to postpone the auto-partitioning to the next commit
> fest, I can strip it from the current patch and re-submit it for the
> next fest (but it's just 2 isolated methods really easy to review).
>
>

I certainly think this should be separated out. In general it is not a
good idea to roll distinct features together. It complicates both the
reviewing process and the discussion.

cheers

andrew


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 13:33:01
Message-ID: 87iqer1g8i.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> It will be best to have the ability to have a specific rejection reason
> for each row rejected. That way we will be able to tell the difference
> between uniqueness violation errors, invalid date format on col7, value
> fails check constraint on col22 etc..

In case that helps, what pgloader does is logging into two files, named
after the table name (not scalable to server-side solution):
table.rej --- lines it could not load, straight from source file
table.rej.log --- errors as given by the server, plus pgloader comment

The pgloader comment is necessary for associating each log line to the
source file line, as it's operating by dichotomy, the server always
report error on line 1.

The idea of having two errors file could be kept though, the aim is to
be able to fix the setup then COPY again the table.rej file when it
happens the errors are not on the file content. Or for loading into
another table, with all columns as text or bytea, then clean data from a
procedure.

Regards,
--
dim


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 13:56:41
Message-ID: 1254923801.26302.231.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2009-10-07 at 15:33 +0200, Dimitri Fontaine wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > It will be best to have the ability to have a specific rejection reason
> > for each row rejected. That way we will be able to tell the difference
> > between uniqueness violation errors, invalid date format on col7, value
> > fails check constraint on col22 etc..
>
> In case that helps, what pgloader does is logging into two files, named
> after the table name (not scalable to server-side solution):
> table.rej --- lines it could not load, straight from source file
> table.rej.log --- errors as given by the server, plus pgloader comment
>
> The pgloader comment is necessary for associating each log line to the
> source file line, as it's operating by dichotomy, the server always
> report error on line 1.
>
> The idea of having two errors file could be kept though, the aim is to
> be able to fix the setup then COPY again the table.rej file when it
> happens the errors are not on the file content. Or for loading into
> another table, with all columns as text or bytea, then clean data from a
> procedure.

I like it.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 14:15:31
Message-ID: 603c8f070910070715u3be478d5l1ae4e9783782dbe3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Hi all,
>
> I think there is a misunderstanding about what the current patch is about.
> The patch includes 2 things:
> - error logging in a table for bad tuples in a COPY operation (see
> http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
> error message, command and so on are automatically logged)
> - auto-partitioning in a hierarchy of child table if the COPY targets a
> parent table.
> The patch does NOT include:
> - logging errors into a file (a feature we can add later on (next commit
> fest?))

My failure to have read the patch is showing here, but it seems to me
that error logging to a table could be problematic: if the transaction
aborts, we'll lose the log. If this is in fact a problem, we should
be implementing logging to a file (or stdout) FIRST.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 14:29:08
Message-ID: 9109.1254925748@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Emmanuel Cecchet wrote:
>> If you prefer to postpone the auto-partitioning to the next commit
>> fest, I can strip it from the current patch and re-submit it for the
>> next fest (but it's just 2 isolated methods really easy to review).

> I certainly think this should be separated out. In general it is not a
> good idea to roll distinct features together. It complicates both the
> reviewing process and the discussion.

I think though that Greg was suggesting that we need some more thought
about the overall road map. Agglomerating "independent" features onto
COPY one at a time is going to lead to a mess, unless they fit into an
overall design plan.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 14:45:45
Message-ID: 4ACCA999.1090502@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Emmanuel Cecchet wrote:
>>
>>> If you prefer to postpone the auto-partitioning to the next commit
>>> fest, I can strip it from the current patch and re-submit it for the
>>> next fest (but it's just 2 isolated methods really easy to review).
>>>
>
>
>> I certainly think this should be separated out. In general it is not a
>> good idea to roll distinct features together. It complicates both the
>> reviewing process and the discussion.
>>
>
> I think though that Greg was suggesting that we need some more thought
> about the overall road map. Agglomerating "independent" features onto
> COPY one at a time is going to lead to a mess, unless they fit into an
> overall design plan.
>
>
>

I don't disagree with that. But even if we get a roadmap of some set of
features we want to implement, rolling them all together isn't a good
way to go.

cheers

andrew


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 15:39:37
Message-ID: 4ACCB639.7050400@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
>
>> Hi all,
>>
>> I think there is a misunderstanding about what the current patch is about.
>> The patch includes 2 things:
>> - error logging in a table for bad tuples in a COPY operation (see
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
>> error message, command and so on are automatically logged)
>> - auto-partitioning in a hierarchy of child table if the COPY targets a
>> parent table.
>> The patch does NOT include:
>> - logging errors into a file (a feature we can add later on (next commit
>> fest?))
>>
>
> My failure to have read the patch is showing here, but it seems to me
> that error logging to a table could be problematic: if the transaction
> aborts, we'll lose the log. If this is in fact a problem, we should
> be implementing logging to a file (or stdout) FIRST.
>
I don't think this is really a problem. You can always do a SELECT in
the error table after the COPY operation if you want to diagnose what
happened before the transaction rollbacks. I think that using a file to
process the bad tuples is actually less practical than a table if you
want to re-insert these tuples in the database.
But anyway, the current implementation captures the tuple with all the
needed information for logging and delegate the logging to a specific
method. If we want to log to a file (or stdout), we can just provide
another method to log the already captured info in a file.

I think that the file approach is a separate feature but can be easily
integrated in the current design. There is just extra work to make sure
concurrent COPY commands writing to the same error file are using proper
locking.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 15:42:11
Message-ID: 603c8f070910070842l732de14eq80e6c76eb675df6d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 7, 2009 at 11:39 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Robert Haas wrote:
>>
>> On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com>
>> wrote:
>>
>>>
>>> Hi all,
>>>
>>> I think there is a misunderstanding about what the current patch is
>>> about.
>>> The patch includes 2 things:
>>> - error logging in a table for bad tuples in a COPY operation (see
>>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
>>> error message, command and so on are automatically logged)
>>> - auto-partitioning in a hierarchy of child table if the COPY targets a
>>> parent table.
>>> The patch does NOT include:
>>> - logging errors into a file (a feature we can add later on (next commit
>>> fest?))
>>>
>>
>> My failure to have read the patch is showing here, but it seems to me
>> that error logging to a table could be problematic: if the transaction
>> aborts, we'll lose the log.  If this is in fact a problem, we should
>> be implementing logging to a file (or stdout) FIRST.
>>
>
> I don't think this is really a problem. You can always do a SELECT in the
> error table after the COPY operation if you want to diagnose what happened
> before the transaction rollbacks.

Uhhh.... if the transaction has aborted, no new commands (including
SELECT) will be accepted until you roll back.

...Robert


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 15:43:07
Message-ID: 4ACCB70B.7090308@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The roadmap I would propose for the current list of enhancements to COPY
is as follows:
1. new syntax for COPY options (already committed)
2. error logging in a table
3. auto-partitioning (just relies on basic error logging, so can be
scheduled anytime after 2)
4. error logging in a file

manu

Andrew Dunstan wrote:
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>
>>> Emmanuel Cecchet wrote:
>>>
>>>
>>>> If you prefer to postpone the auto-partitioning to the next commit
>>>> fest, I can strip it from the current patch and re-submit it for the
>>>> next fest (but it's just 2 isolated methods really easy to review).
>>>>
>>>>
>>
>>
>>> I certainly think this should be separated out. In general it is not a
>>> good idea to roll distinct features together. It complicates both the
>>> reviewing process and the discussion.
>>>
>>>
>> I think though that Greg was suggesting that we need some more thought
>> about the overall road map. Agglomerating "independent" features onto
>> COPY one at a time is going to lead to a mess, unless they fit into an
>> overall design plan.
>>
>>
>>
>>
>
> I don't disagree with that. But even if we get a roadmap of some set of
> features we want to implement, rolling them all together isn't a good
> way to go.
>
> cheers
>
> andrew
>
>
>

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 15:45:11
Message-ID: 4ACCB787.3070207@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Wed, Oct 7, 2009 at 11:39 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
>
>> Robert Haas wrote:
>>
>>> On Wed, Oct 7, 2009 at 9:12 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com>
>>> wrote:
>>>
>>>
>>>> Hi all,
>>>>
>>>> I think there is a misunderstanding about what the current patch is
>>>> about.
>>>> The patch includes 2 things:
>>>> - error logging in a table for bad tuples in a COPY operation (see
>>>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY for an example; the
>>>> error message, command and so on are automatically logged)
>>>> - auto-partitioning in a hierarchy of child table if the COPY targets a
>>>> parent table.
>>>> The patch does NOT include:
>>>> - logging errors into a file (a feature we can add later on (next commit
>>>> fest?))
>>>>
>>>>
>>> My failure to have read the patch is showing here, but it seems to me
>>> that error logging to a table could be problematic: if the transaction
>>> aborts, we'll lose the log. If this is in fact a problem, we should
>>> be implementing logging to a file (or stdout) FIRST.
>>>
>>>
>> I don't think this is really a problem. You can always do a SELECT in the
>> error table after the COPY operation if you want to diagnose what happened
>> before the transaction rollbacks.
>>
>
> Uhhh.... if the transaction has aborted, no new commands (including
> SELECT) will be accepted until you roll back.
>
You are suggesting then that it is the COPY command that aborts the
transaction. That would only happen if you had set a limit on the number
of errors that you want to accept in a COPY command (in which case you
know that there is something really wrong with your input file and you
don't want the server to process that file).

manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 15:50:05
Message-ID: 4ACCB8AD.2010803@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> Absolutely, that's the whole point of logging to a file in the first
> place. What needs to happen here is that when one is aborted, you
> need to make sure that fact is logged, and with enough information
> (the pid?) to tie it to the COPY that failed. Then someone can crawl
> the logs to figure out what happened and what data did and didn't get
> loaded. Ideally you'd want to have that as database table information
> instead, to allow automated recovery and re-commit in the cases where
> the error wasn't inherent in the data but instead some other type of
> database failure.
If one is aborted, nothing gets loaded anyway. Now if you want a
separate log of the successful commands, I don't think that should apply
just to COPY but to any operation (insert, update, DDL, ...) if you want
to build some automatic retry mechanism. But this seems independent from
COPY to me.

manu

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 16:25:28
Message-ID: 603c8f070910070925n1719c38aj7c589a909db2eb16@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 7, 2009 at 11:45 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> You are suggesting then that it is the COPY command that aborts the
> transaction. That would only happen if you had set a limit on the number of
> errors that you want to accept in a COPY command (in which case you know
> that there is something really wrong with your input file and you don't want
> the server to process that file).

But I still want my log even if the COPY bombs out. I want to report
the first group of errors back to my user...

...Robert


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 23:45:38
Message-ID: alpine.GSO.2.01.0910071923070.19867@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Oct 2009, Emmanuel Cecchet wrote:

> I think there is a misunderstanding about what the current patch is
> about...the patch does NOT include logging errors into a file (a feature
> we can add later on (next commit fest?))

I understand that (as one of the few people who has read the patch it
seems), but as you can might guess from the feedback here that's not the
way I expect your patch is actually going to be handled. Something that
logs only to a table without the interface for the file output at least
specified isn't the sort of thing that this group tends to go along with
very well. Since as it is we haven't even nailed down how the file output
is even going to look or work yet, the table logging isn't very likely to
go in unless it's at least clear that the future plans there will cleanly
stack on top. That's what people are alluding to mentioning the whole
roadmap concept for all this.

I get the impression you were hoping to get another chunk of this commited
on this round. I would guess that realistically it's going to take at
least a few more weeks for the rest of this to get nailed down, and that a
really solid feature should be ready by the next CF instead. You should
actually be proud of the progress you spurred on the COPY options stuff
that's in there now, that idea has been floating around for a while now
but until your patch there wasn't any momentum behind doing something
about it. The problem you're facing now is that so many people have been
thinking about this particular area for so long without any code to talk
about that you're now stuck with all that pent up uncoded design to clear.

> I can put the auto-partitioning in a separate patch if that helps but this
> patch relies on error logging to log possible errors in the routing of
> tuples.

Understood. I know you've gotten some other coding feedback here, and
you've been very good about taking that and producing updated versions
which is appreciated. I would recommend that the next time you resubmit
this, if you could break the logging and partitioning patches into a pair
that depend on one another, that would make life easier for everyone else
(albeit probably a harder for you). At that point we should be set to
have some others take over some of that tweaking, with myself, Selena, and
Jeff all interested and capable of helping out here.

> If you prefer to postpone the auto-partitioning to the next commit fest,
> I can strip it from the current patch and re-submit it for the next fest
> (but it's just 2 isolated methods really easy to review).

That was one of points I was trying to make--that would be an easier to
review by itself, but as it is people interested in it can't consider it
without also staring at the logging stuff. And people who are focusing on
the logging bits find it distracting, so nobody is really happy with the
current combined patch.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-07 23:52:15
Message-ID: alpine.GSO.2.01.0910071946150.19867@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 7 Oct 2009, Robert Haas wrote:

> On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>
>> I doubt taskmaster Robert is going to let this one linger around with
>> scope creep for too long before being pushed out to the next
>> CommitFest.
>
> I'm can't decide whether to feel good or bad about that appelation, so
> I'm going with both.

That was a compliment on your project management skills. Keeping the CF
work moving forward steadily is both unglamorous and extremely valuable,
and I don't think anyone else even understands why you've volunteered to
handle so much of it. But I know I appreciate it.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 01:07:50
Message-ID: 603c8f070910071807p8fe7568mcd2fed02a584d1f4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 7, 2009 at 7:52 PM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> On Wed, 7 Oct 2009, Robert Haas wrote:
>
>> On Wed, Oct 7, 2009 at 3:17 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>>
>>> I doubt taskmaster Robert is going to let this one linger around with
>>> scope creep for too long before being pushed out to the next CommitFest.
>>
>> I'm can't decide whether to feel good or bad about that appelation, so
>> I'm going with both.
>
> That was a compliment on your project management skills.  Keeping the CF
> work moving forward steadily is both unglamorous and extremely valuable, and
> I don't think anyone else even understands why you've volunteered to handle
> so much of it.  But I know I appreciate it.

Thanks. I'm not sure I completely understand it, either. I'm sort of
hoping someone else will step up to the plate, but in the absence of
such a person I kind of hate to leave it hanging. Some defective part
of my brain enjoys seeing things run smoothly more than it enjoys
being lazy.

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 02:30:57
Message-ID: 603c8f070910071930t42c2a416tbef4d57b7fe3ca7f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Robert,
>
> Here is the new version of the patch that applies to CVS HEAD as of this
> morning.
>
> Emmanuel

I took a look at this patch tonight and, having now read through some
of it, I have some more detailed comments.

It seems quite odd to me that when COPY succeeds but there are errors,
the transaction commits. The only indication that some of my data
didn't end up in the table is that the output says "COPY n" where n is
less than the total number of rows I attempted to copy. On the other
hand, it would be equally bad if the transaction aborted, because then
my error logging table would not get populated - I note that that's
the behavior we do get if the max errors threshold is exceeded. I'm
not sure what the right answer is here, but it needs some thought and
discussion. I think at a minimum the caller needs some indication of
the number of FAILED rows, not just the number of succesful ones.

What's really bad about this is that a flag called "error_logging" is
actually changing the behavior of the command in a way that is far
more dramatic than (and doesn't actually have much to do with) error
logging. It's actually making a COPY command succeed that would
otherwise have failed. That's not just a logging change.

I think the underlying problem is that there is more than one feature
here, and they're kind of mixed together. The fact that the
partitioning code needs to be separated from the error logging stuff
has already been discussed, but I think it actually needs to be broken
down even further. I think there are three separate features in the
error logging code:

A. Ability to ignore a certain number of errors (of certain types?)
and still get the other tuples into the table anyway.
B. Ability to return error information in a structured format rather
than as an exception message.
C. Ability to direct the structured error messages to a table.

Each of those features deserves a separate discussion to decide
whether we want it and how best to implement it. Personally, I think
we should skip (C), at least as a starting point. Instead of logging
to a table, I think we should consider making COPY return the tuples
indicating the error as a query result, the same way EXPLAIN returns
the query plan. It looks like this would eliminate the need for at
least three of the new COPY options without losing much functionality.

I also think that, if possible, (A) and (B) should be implemented as
separate features, so that each one can be used separately or both can
be used in combination.

Some other notes:

1. The copy_errorlogging.out regression test fails, at least on my
machine. I guess this problem was discussed previously, but perhaps
you should change the setup of the test in some way so that you can
generate a patch that doesn't require manual fiddling to get it the
regression tests to pass.

2. The error logging doesn't seem to work in all cases.

rhaas=# copy foo from '/home/rhaas/x' (error_logging,
error_logging_table_name 'errlogging');
ERROR: duplicate key value violates unique constraint "foo_pkey"
DETAIL: Key (id)=(1) already exists.
CONTEXT: COPY foo, line 1: "1 Robert"

I assume that this fails because the code is trained to catch some
specific category of errors and redirect those to the error logging
table while letting everything take its natural course. I don't feel
that that's a terribly useful behavior, and on the other hand, I'm not
sure it's going to be possible to fix it without killing performance.

3. The option checking for the error_logging options is insufficient.
For example, if ERROR_LOGGING_TABLE_NAME is specified but
ERROR_LOGGING is not specified, COPY still works (and just ignores the
ERROR_LOGGING_TABLE_NAME). Note that this is quite different from what
HEADER does, for example.

4. Specifiying an error_logging_table_name with spaces or other funny
characters in it fails.

COPY foo FROM STDIN (ERROR_LOGGING, ERROR_LOGGING_TABLE_NAME 'foo bar baz');

5. The assumption that the public schema is an appropriate default
does not seem right to me. I would think that we would want to
default to the first schema in the user's search path.

6. errorlogging.c is poorly named, because it has specifically to do
with logging errors in COPY, rather than with error logging in
general; there's also no reason for it to live in utils/misc. Also,
it does not include the same headers as the other files in the same
directory.

7. There's no need for the OidLinkedList structure; we already have a
system for handling lists of OIDs. See pg_list.h.

8. The OID cache is interesting; have you benchmarked this to see how
much it improves performance? If so, you should describe your
methodology and performance results. It might be worthwhile to leave
this out of the first version of the partitioning patch for simplicity
and add submit as a follow-on performance patch. Do we really need a
GUC to size this cache, or is there a size beyond which it doesn't
help much?

9. The style of the code is not consistent with what PostgreSQL does
elsewhere. The use of '\see' is very distracting. The style of
comments does not consistently patch the PostgreSQL style: in places
comments start with "/**", and there's one in copy.c that reads "/*
<--- Error logging function ---> */". Also, the standard way to write
a multiline comment is with the text starting on the line following
"/*", not the "/*" line itself. Also, some variables are declared as
"type * name" rather than "type *name".

10. This patch twiddles with the regression tests in ways that are not
related to the functionality it is adding; consistent with the idea
that a patch should do one thing and do it well, you should revert
this part.

Overall, I think that this patch is not very close to being
committable and accordingly I am going to mark it as Returned with
Feedback. However, I hope that the discussion will continue and that
we can get consensus on some changes for next CommitFest.

...Robert


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 08:42:09
Message-ID: 87vdiqgtum.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> What's really bad about this is that a flag called "error_logging" is
> actually changing the behavior of the command in a way that is far
> more dramatic than (and doesn't actually have much to do with) error
> logging. It's actually making a COPY command succeed that would
> otherwise have failed. That's not just a logging change.

That's what has been asked for, a COPY that is able to load the good
rows in the presence of bad rows. I'd rather change the option name than
the behavior. Pretty please.

Regards,
--
dim


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 10:43:30
Message-ID: 603c8f070910080343w791a688aq6431f9bc48badfed@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 4:42 AM, Dimitri Fontaine <dfontaine(at)hi-media(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> What's really bad about this is that a flag called "error_logging" is
>> actually changing the behavior of the command in a way that is far
>> more dramatic than (and doesn't actually have much to do with) error
>> logging.  It's actually making a COPY command succeed that would
>> otherwise have failed.  That's not just a logging change.
>
> That's what has been asked for, a COPY that is able to load the good
> rows in the presence of bad rows. I'd rather change the option name than
> the behavior. Pretty please.

I'm a little mystified by this response since I spent several
paragraphs following the one that you have quoted here explaining how
I think we should approach the problem of providing the features that
are currently all encapsulated under the mantle of "error logging". I
don't think changing the name is going to be sufficient by itself,
but, well, go back and read what I suggested (and comment on it if you
have any thoughts or just want to say +/-1).

Lest there be any unclarity, I am NOT trying to shoot down this
feature with my laser-powered bazooka. What I AM trying to do is
point out - as early as possible - things that I believe that a
hypothetical committer would likely also object to. That's the major
point of having non-committer reviewers, at least AIUI. I am not
opposed to the underlying features except to the extent that they have
unfixable design problems. I believe that they CURRENTLY have design
problems, but I'm hopeful that, with some discussion, we can agree on
a way forward. I think, though, that we should try to keep the
discussion technical (how well does this work?) rather than a
referendum on the underlying feature (which someone might object to,
but the sheer level of interest in this patch is a fairly compelling
argument that there is support for at least some of what it is trying
to do).

...Robert


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 12:34:33
Message-ID: 87pr8ydpye.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm a little mystified by this response since I spent several
> paragraphs following the one that you have quoted here explaining how
> I think we should approach the problem of providing the features that
> are currently all encapsulated under the mantle of "error logging".

Yeah sorry I was to quick to answer. Basically having the bad rows
returned to the client the same way EXPLAIN does feels strange. Not very
scalable and sounds uneasy to manage client side...

So it feels to me like when you talk about postponing the bad lines
rejection handling you're in fact postponing it all, as that's the thing
we're wanting this patch for. I couldn't really parse if your proposal
is a step towards having a better rejection handling, though.

Hope this clarifies it, sorry again for bad try at being terse,

Regards,
--
Dimitri Fontaine
PostgreSQL DBA, Architecte


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 12:40:36
Message-ID: 1255005636.26302.357.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2009-10-07 at 22:30 -0400, Robert Haas wrote:
> On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> > Robert,
> >
> > Here is the new version of the patch that applies to CVS HEAD as of this
> > morning.
> >
> > Emmanuel
>
> I took a look at this patch tonight and, having now read through some
> of it, I have some more detailed comments.
>
> It seems quite odd to me that when COPY succeeds but there are errors,
> the transaction commits. The only indication that some of my data
> didn't end up in the table is that the output says "COPY n" where n is
> less than the total number of rows I attempted to copy. On the other
> hand, it would be equally bad if the transaction aborted, because then
> my error logging table would not get populated - I note that that's
> the behavior we do get if the max errors threshold is exceeded. I'm
> not sure what the right answer is here, but it needs some thought and
> discussion. I think at a minimum the caller needs some indication of
> the number of FAILED rows, not just the number of succesful ones.
>
> What's really bad about this is that a flag called "error_logging" is
> actually changing the behavior of the command in a way that is far
> more dramatic than (and doesn't actually have much to do with) error
> logging. It's actually making a COPY command succeed that would
> otherwise have failed. That's not just a logging change.
>
> I think the underlying problem is that there is more than one feature
> here, and they're kind of mixed together. The fact that the
> partitioning code needs to be separated from the error logging stuff
> has already been discussed, but I think it actually needs to be broken
> down even further. I think there are three separate features in the
> error logging code:
>
> A. Ability to ignore a certain number of errors (of certain types?)
> and still get the other tuples into the table anyway.
> B. Ability to return error information in a structured format rather
> than as an exception message.
> C. Ability to direct the structured error messages to a table.
>
> Each of those features deserves a separate discussion to decide
> whether we want it and how best to implement it. Personally, I think
> we should skip (C), at least as a starting point. Instead of logging
> to a table, I think we should consider making COPY return the tuples
> indicating the error as a query result, the same way EXPLAIN returns
> the query plan. It looks like this would eliminate the need for at
> least three of the new COPY options without losing much functionality.
>
> I also think that, if possible, (A) and (B) should be implemented as
> separate features, so that each one can be used separately or both can
> be used in combination.

I don't see any logical issues with what has been proposed. Current COPY
allows k=0 cumulative errors before it aborts. This patch allows us to
provide a parameter to vary k. If we get nerrors > k then the COPY
should abort. If nerrors <= k then the COPY can commit. Exactly what we
need. Perhaps the parameter should simply be called max_errors rather
than error_logging, so the meaning is clear.

We must have detailed error reporting for each row individually,
otherwise we will not be able to correct the errors. 125346 rows loaded,
257 errors is not really useful information, so I wouldn't bother trying
to change the return info to supply the number of errors. If you defined
an error file you know you need to check it. If it doesn't exist, no
errors. Simple.

For me, A and B are inseparable. Dimitri's idea to have an error file
and a error reason file seems very good to me.

C may be important because of security concerns regarding writing error
files but it isn't critical in first commit.

(Not really in reply to Robert: Most database loaders define
max_num_bad_rows as a percentage of the size of the file. Otherwise you
need to read the file to see how big it is before you decide an
appropriate setting, which if it is very big is a bad idea).

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 13:42:28
Message-ID: 603c8f070910080642p3d832df7o2dd7717be04b0e1f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 8:34 AM, Dimitri Fontaine <dfontaine(at)hi-media(dot)com> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm a little mystified by this response since I spent several
>> paragraphs following the one that you have quoted here explaining how
>> I think we should approach the problem of providing the features that
>> are currently all encapsulated under the mantle of "error logging".
>
> Yeah sorry I was to quick to answer. Basically having the bad rows
> returned to the client the same way EXPLAIN does feels strange. Not very
> scalable and sounds uneasy to manage client side...

I was thinking of returning the error messages, rather than the rows,
same as what is getting logged to the side table by the current patch.

As for table vs. result-set, let me just say that a patch that returns
result-set will be easier and more likely to be committed, and a
follow-on patch can add a feature to redirect it to a table (or maybe
we'll come up with another solution to that part, like WITH
copy_results AS (COPY ...) INSERT INTO ... SELECT ... FROM
copy_results), which would actually be far more powerful and with many
fewer definitional challenges).

> So it feels to me like when you talk about postponing the bad lines
> rejection handling you're in fact postponing it all, as that's the thing
> we're wanting this patch for. I couldn't really parse if your proposal
> is a step towards having a better rejection handling, though.
>
> Hope this clarifies it, sorry again for bad try at being terse,

Well, right now we are postponing EVERYTHING, because there is a week
left in the CommitFest and this patch isn't close to being
committable. The next time someone takes a crack at this, IMHO, it
should be broken into significantly smaller pieces. Exactly which of
those pieces gets done first is up to the patch author, of course, but
I don't see why someone couldn't have a workable patch with an
interesting subset of this functionality done in time for next CF,
especially given this code to hack on for a starting point.

...Robert


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Emmanuel Cecchet" <manu(at)asterdata(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Emmanuel Cecchet" <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 14:22:29
Message-ID: 4ACDAF55020000250002B72D@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It seems quite odd to me that when COPY succeeds but there are
> errors, the transaction commits. The only indication that some of
> my data didn't end up in the table is that the output says "COPY n"
> where n is less than the total number of rows I attempted to copy.
> On the other hand, it would be equally bad if the transaction
> aborted, because then my error logging table would not get populated
> - I note that that's the behavior we do get if the max errors
> threshold is exceeded. I'm not sure what the right answer is here,
> but it needs some thought and discussion. I think at a minimum the
> caller needs some indication of the number of FAILED rows, not just
> the number of succesful ones.

When the COPY fails due to a high error count, we should be able to
determine:

(1) How many rows were read.
(2) How many of the rows read had errors.
(3) Images of failed rows with errors found on each.

On success, we need the above, plus the failed rows in a format suable
for editing and re-applying as needed.

> Instead of logging to a table, I think we should consider making
> COPY return the tuples indicating the error as a query result, the
> same way EXPLAIN returns the query plan.

This seems attractive, particularly if it can be fed to an INSERT
statement (like INSERT ... SELECT does). The only problem I see with
it is that PostgreSQL seems to not want to return rows from statements
which fail. (I know this is generally considered a feature, but it
sometimes has unfortunate consequences.) Is there a way to satisfy
(3) above on a failed COPY statement if we go this route?

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:01:53
Message-ID: 23958.1255014113@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Lest there be any unclarity, I am NOT trying to shoot down this
> feature with my laser-powered bazooka.

Well, if you need somebody to do that --- I took a quick look through
this patch, and it is NOT going to get committed. Not in anything
approximately like its current form. The questions about how the
logging should act don't come anywhere near addressing the fundamental
problem with the patch, which is that IT DOESN'T WORK. You can *not*
suppose that you can just put a PG_TRY block around some processing
and catch any random error and keep going. I see that the patch tries
to avoid this by only catching certain major errcode categories, which
merely makes it useless while still being untrustworthy.

As an example of a case that anyone would expect to work that cannot
work with this approach, I submit unique-index violations. When the
index throws the error, the bad row has already been inserted in the
table (and maybe some other indexes too). The *only* way to clean up
is to abort the transaction/subtransaction so that the row will not be
considered good.

More generally, since we are calling user-defined BEFORE INSERT triggers
in there, we have to assume that absolutely anything at all could have
been done by the triggers. PG_CATCH doesn't even pretend to cope with
that.

So as far as I can see, the only form of COPY error handling that
wouldn't be a cruel joke is to run a separate subtransaction for each
row, and roll back the subtransaction on error. Of course the problems
with that are (a) speed, (b) the 2^32 limit on command counter IDs
would mean a max of 2^32 rows per COPY, which is uncomfortably small
these days. Previous discussions of the problem have mentioned trying
to batch multiple rows per subtransaction to alleviate both issues.
Not easy of course, but that's why it's not been done yet. With a
patch like this you'd also have (c) how to avoid rolling back the
insertions into the logging table.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:29:01
Message-ID: 20091008152901.GC5510@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas escribió:

> Some defective part of my brain enjoys seeing things run smoothly more
> than it enjoys being lazy.

Strangely, that seems to say you'd make a bad Perl programmer, per Larry
Wall's three virtues.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:32:17
Message-ID: 603c8f070910080832o3b83a332p63575301a44c4c23@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 11:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Lest there be any unclarity, I am NOT trying to shoot down this
>> feature with my laser-powered bazooka.
>
> Well, if you need somebody to do that

Well, I'm trying not to demoralize people who have put in hard work,
however much it may not be usable. Still, your points are well taken.
I did raise some of them (with a lot less technical detail) in my
review of last night.

> So as far as I can see, the only form of COPY error handling that
> wouldn't be a cruel joke is to run a separate subtransaction for each
> row, and roll back the subtransaction on error.  Of course the problems
> with that are (a) speed, (b) the 2^32 limit on command counter IDs
> would mean a max of 2^32 rows per COPY, which is uncomfortably small
> these days.  Previous discussions of the problem have mentioned trying
> to batch multiple rows per subtransaction to alleviate both issues.
> Not easy of course, but that's why it's not been done yet.  With a
> patch like this you'd also have (c) how to avoid rolling back the
> insertions into the logging table.

Yeah. I think it's going to be hard to make this work without having
standalone transactions. One idea would be to start a subtransaction,
insert tuples until one fails, then rollback the subtransaction and
start a new one, and continue on until the error limit is reached. At
the end, if the number of rollbacks is > 0, then roll back the final
subtransaction also. This wouldn't have the property of getting the
unerrorred data into the table, but at least it would let you report
all the errors in a single pass, hopefully without being gratingly
slow. Subcommitting every single row is going to be really painful,
especially after Hot Standby goes in and we have to issue a WAL record
after every 64 subtransactions (AIUI).

Another possible approach, which isn't perfect either, is the idea of
allowing COPY to generate a single column of output of type text[].
That greatly reduces the number of possible error cases, and at least
gets the data into the DB where you can hack on it. But it's still
going to be painful for some use cases.

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:33:49
Message-ID: 603c8f070910080833g43154ac1wa31c8a813c39309@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 11:29 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Robert Haas escribió:
>
>> Some defective part of my brain enjoys seeing things run smoothly more
>> than it enjoys being lazy.
>
> Strangely, that seems to say you'd make a bad Perl programmer, per Larry
> Wall's three virtues.

Don't worry. I have a strong preference for them running smoothly
without me whenever possible. It's only that my hubris and impatience
are capable of overriding my laziness from time to time.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:50:19
Message-ID: 24759.1255017019@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Subcommitting every single row is going to be really painful,
> especially after Hot Standby goes in and we have to issue a WAL record
> after every 64 subtransactions (AIUI).

Yikes ... I had not been following that discussion, but that sure sounds
like a deal-breaker. For HS, not this. But back to topic:

> Another possible approach, which isn't perfect either, is the idea of
> allowing COPY to generate a single column of output of type text[].
> That greatly reduces the number of possible error cases, and at least
> gets the data into the DB where you can hack on it. But it's still
> going to be painful for some use cases.

Yeah, that connects to the previous discussion about refactoring COPY
into a series of steps that the user can control.

Ultimately, there's always going to be a tradeoff between speed and
flexibility. It may be that we should just say "if you want to import
dirty data, it's gonna cost ya" and not worry about the speed penalty
of subtransaction-per-row. But that still leaves us with the 2^32
limit. I wonder whether we could break down COPY into sub-sub
transactions to work around that...

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 15:59:19
Message-ID: 603c8f070910080859mc4cc2fp7998d3d6c7d5c6c7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another possible approach, which isn't perfect either, is the idea of
>> allowing COPY to generate a single column of output of type text[].
>> That greatly reduces the number of possible error cases, and at least
>> gets the data into the DB where you can hack on it.  But it's still
>> going to be painful for some use cases.
>
> Yeah, that connects to the previous discussion about refactoring COPY
> into a series of steps that the user can control.
>
> Ultimately, there's always going to be a tradeoff between speed and
> flexibility.  It may be that we should just say "if you want to import
> dirty data, it's gonna cost ya" and not worry about the speed penalty
> of subtransaction-per-row.  But that still leaves us with the 2^32
> limit.  I wonder whether we could break down COPY into sub-sub
> transactions to work around that...

How would that work? Don't you still need to increment the command counter?

...Robert


From: Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:12:40
Message-ID: 751261b20910080912md3176e8ga671c768b4fee055@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Yeah. I think it's going to be hard to make this work without having
> standalone transactions. One idea would be to start a subtransaction,
> insert tuples until one fails, then rollback the subtransaction and
> start a new one, and continue on until the error limit is reached.
>

I've found performance is reasonable, for data with low numbers of errors
(say 1 per 100,000 records or less) doing the following:

SAVEPOINT bulk;
Insert 1000 records using COPY.

If there is an error, rollback to bulk, and step through each line
individually within its own "individual" subtransaction. All good lines are
kept and bad lines are logged; client side control makes logging trivial.

The next set of 1000 records is done in bulk again.

1000 records per savepoint seems to be a good point for my data without too
much time lost to overhead or too many records to retry due to a failing
record. Of course, it is controlled by the client side rather than server
side so reporting back broken records is trivial.

It may be possible to boost performance by:

1) Having copy remember which specific line caused the error. So it can
replace lines 1 through 487 in a subtransaction since it knows those are
successful. Run 488 in its on subtransaction. Run 489 through ... in a new
subtransaction.
2) Increasing the number of records per subtransaction if data is clean. It
wouldn't take long until you were inserting millions of records per
subtransaction for a large data set. This should make the subtransaction
overhead minimal. Small imports would still run slower but very large
imports of clean data should be essentially the same speed in the end.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:16:35
Message-ID: 1255018595.11374.292.camel@jd-desktop.unknown.charter.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 11:59 -0400, Robert Haas wrote:
> On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Another possible approach, which isn't perfect either, is the idea of
> >> allowing COPY to generate a single column of output of type text[].
> >> That greatly reduces the number of possible error cases, and at least
> >> gets the data into the DB where you can hack on it. But it's still
> >> going to be painful for some use cases.
> >
> > Yeah, that connects to the previous discussion about refactoring COPY
> > into a series of steps that the user can control.
> >
> > Ultimately, there's always going to be a tradeoff between speed and
> > flexibility. It may be that we should just say "if you want to import
> > dirty data, it's gonna cost ya" and not worry about the speed penalty
> > of subtransaction-per-row. But that still leaves us with the 2^32
> > limit. I wonder whether we could break down COPY into sub-sub
> > transactions to work around that...
>
> How would that work? Don't you still need to increment the command counter?

Couldn't you just commit each range of subtransactions based on some
threshold?

COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 1000000;

It counts to 1mil, commits starts a new transaction. Yes there would be
1million sub transactions but once it hits those clean, it commits.

?

Joshua D. Drake

>
> ...Robert
>
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:21:00
Message-ID: 25557.1255018860@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wonder whether we could break down COPY into sub-sub
>> transactions to work around that...

> How would that work? Don't you still need to increment the command counter?

Actually, command counter doesn't help because incrementing the CC
doesn't give you a rollback boundary between rows inserted before it
and afterwards. What I was vaguely imaging was

-- outer transaction for whole COPY

-- sub-transactions that are children of outer transaction

-- sub-sub-transactions that are children of sub-transactions

You'd eat a sub-sub-transaction per row, and start a new sub-transaction
every 2^32 rows.

However, on second thought this really doesn't get us anywhere, it just
moves the 2^32 restriction somewhere else. Once the outer transaction
gets to be more than 2^31 XIDs old, the database is going to stop
because of XID wraparound.

So really we have to find some way to only expend one XID per failure,
not one per row.

Another approach that was discussed earlier was to divvy the rows into
batches. Say every thousand rows you sub-commit and start a new
subtransaction. Up to that point you save aside the good rows somewhere
(maybe a tuplestore). If you get a failure partway through a batch,
you start a new subtransaction and re-insert the batch's rows up to the
bad row. This could be pretty awful in the worst case, but most of the
time it'd probably perform well. You could imagine dynamically adapting
the batch size depending on how often errors occur ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: jd(at)commandprompt(dot)com
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:37:11
Message-ID: 25829.1255019831@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Couldn't you just commit each range of subtransactions based on some
> threshold?

> COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 1000000;

> It counts to 1mil, commits starts a new transaction. Yes there would be
> 1million sub transactions but once it hits those clean, it commits.

Hmm, if we were willing to break COPY into multiple *top level*
transactions, that would avoid my concern about XID wraparound.
The issue here is that if the COPY does eventually fail (and there
will always be failure conditions, eg out of disk space), then some
of the previously entered rows would still be there; but possibly
not all of them, depending on whether we batch rows. The latter
property actually bothers me more than the former, because it would
expose an implementation detail to the user. Thoughts?

Also, this does not work if you want the copy to be part of a bigger
transaction, viz
BEGIN;
do something;
COPY ...;
do something else;
COMMIT;

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: jd(at)commandprompt(dot)com, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:46:43
Message-ID: 603c8f070910080946o1fbe2245re5b11be4d6e4199c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 12:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Couldn't you just commit each range of subtransactions based on some
>> threshold?
>
>> COPY foo from '/tmp/bar/' COMMIT_THRESHOLD 1000000;
>
>> It counts to 1mil, commits starts a new transaction. Yes there would be
>> 1million sub transactions but once it hits those clean, it commits.
>
> Hmm, if we were willing to break COPY into multiple *top level*
> transactions, that would avoid my concern about XID wraparound.
> The issue here is that if the COPY does eventually fail (and there
> will always be failure conditions, eg out of disk space), then some
> of the previously entered rows would still be there; but possibly
> not all of them, depending on whether we batch rows.

Yeah, I don't feel good about that.

> Also, this does not work if you want the copy to be part of a bigger
> transaction, viz
>        BEGIN;
>        do something;
>        COPY ...;
>        do something else;
>        COMMIT;

Or that.

...Robert


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <jd(at)commandprompt(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Emmanuel Cecchet" <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, "Emmanuel Cecchet" <manu(at)asterdata(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:49:33
Message-ID: 4ACDD1CD020000250002B751@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hmm, if we were willing to break COPY into multiple *top level*
> transactions, that would avoid my concern about XID wraparound.
> The issue here is that if the COPY does eventually fail (and there
> will always be failure conditions, eg out of disk space), then some
> of the previously entered rows would still be there; but possibly
> not all of them, depending on whether we batch rows. The latter
> property actually bothers me more than the former, because it would
> expose an implementation detail to the user. Thoughts?
>
> Also, this does not work if you want the copy to be part of a bigger
> transaction, viz
> BEGIN;
> do something;
> COPY ...;
> do something else;
> COMMIT;

I was considering suggesting multiple top-level transactions, partly
based on the fact that other bulk-load utilities that I've used
support that. Then I realized that those were *client* side
applications, and didn't have to worry about being part of an
enclosing transaction. It seems wrong to break transactional
semantics in a single statement execution. Multiple top-level
transactions could be fine in a bulk-load *application*; but not, in
my view, in the COPY *statement*.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 16:50:41
Message-ID: 603c8f070910080950n1d409d38v40cd4147c961b1dd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I wonder whether we could break down COPY into sub-sub
>>> transactions to work around that...
>
>> How would that work?  Don't you still need to increment the command counter?
>
> Actually, command counter doesn't help because incrementing the CC
> doesn't give you a rollback boundary between rows inserted before it
> and afterwards.  What I was vaguely imaging was

Oh, right.

> So really we have to find some way to only expend one XID per failure,
> not one per row.

Agreed.

> Another approach that was discussed earlier was to divvy the rows into
> batches.  Say every thousand rows you sub-commit and start a new
> subtransaction.  Up to that point you save aside the good rows somewhere
> (maybe a tuplestore).  If you get a failure partway through a batch,
> you start a new subtransaction and re-insert the batch's rows up to the
> bad row.  This could be pretty awful in the worst case, but most of the
> time it'd probably perform well.  You could imagine dynamically adapting
> the batch size depending on how often errors occur ...

Yeah, I think that's promising. There is of course the possibility
that a row which previously succeeded could fail the next time around,
but most of the time that shouldn't happen, and it should be possible
to code it so that it still behaves somewhat sanely if it does.

...Robert


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 17:09:34
Message-ID: alpine.GSO.2.01.0910081304160.25300@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Oct 2009, Tom Lane wrote:

> It may be that we should just say "if you want to import dirty data,
> it's gonna cost ya" and not worry about the speed penalty of
> subtransaction-per-row.

This goes along with the response I gave on objections to adding other
bits of overhead into COPY. If the performance only suffers when you're
targeting unclean data, the users this feature targets will glady accept
that trade-off. You're still way ahead of the other options here at the
finish line.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 17:19:03
Message-ID: alpine.GSO.2.01.0910081310300.25300@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 8 Oct 2009, Rod Taylor wrote:

> 1) Having copy remember which specific line caused the error. So it can
> replace lines 1 through 487 in a subtransaction since it knows those are
> successful. Run 488 in its on subtransaction. Run 489 through ... in a
> new subtransaction.

This is the standard technique used in other bulk loaders I'm aware of.

> 2) Increasing the number of records per subtransaction if data is clean.
> It wouldn't take long until you were inserting millions of records per
> subtransaction for a large data set.

You can make it adaptive in both directions with some boundaries. If you
double the batch size every time there's a clean commit, and halve it
every time there's an error, start batching at 1024 and bound to the range
[1,1048576]. That's close to optimal behavior here if combined with the
targeted retry described in (1).

The retry scheduling and batch size parts are the trivial and well
understood parts here. Actually getting all this to play nicely with
transactions and commit failures (rather than just bad data failures) is
what's difficult.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 17:26:10
Message-ID: 26693.1255022770@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another approach that was discussed earlier was to divvy the rows into
>> batches. Say every thousand rows you sub-commit and start a new
>> subtransaction. Up to that point you save aside the good rows somewhere
>> (maybe a tuplestore). If you get a failure partway through a batch,
>> you start a new subtransaction and re-insert the batch's rows up to the
>> bad row. This could be pretty awful in the worst case, but most of the
>> time it'd probably perform well. You could imagine dynamically adapting
>> the batch size depending on how often errors occur ...

> Yeah, I think that's promising. There is of course the possibility
> that a row which previously succeeded could fail the next time around,
> but most of the time that shouldn't happen, and it should be possible
> to code it so that it still behaves somewhat sanely if it does.

Actually, my thought was that failure to reinsert a previously good
tuple should cause us to abort the COPY altogether. This is a
cheap-and-easy way of avoiding sorceror's apprentice syndrome.
Suppose the failures are coming from something like out of disk space,
transaction timeout, whatever ... a COPY that keeps on grinding no
matter what is *not* ideal.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 18:43:51
Message-ID: 603c8f070910081143h2232509agd137f4023a4c6315@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 8, 2009 at 1:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Another approach that was discussed earlier was to divvy the rows into
>>> batches.  Say every thousand rows you sub-commit and start a new
>>> subtransaction.  Up to that point you save aside the good rows somewhere
>>> (maybe a tuplestore).  If you get a failure partway through a batch,
>>> you start a new subtransaction and re-insert the batch's rows up to the
>>> bad row.  This could be pretty awful in the worst case, but most of the
>>> time it'd probably perform well.  You could imagine dynamically adapting
>>> the batch size depending on how often errors occur ...
>
>> Yeah, I think that's promising.  There is of course the possibility
>> that a row which previously succeeded could fail the next time around,
>> but most of the time that shouldn't happen, and it should be possible
>> to code it so that it still behaves somewhat sanely if it does.
>
> Actually, my thought was that failure to reinsert a previously good
> tuple should cause us to abort the COPY altogether.  This is a
> cheap-and-easy way of avoiding sorceror's apprentice syndrome.
> Suppose the failures are coming from something like out of disk space,
> transaction timeout, whatever ... a COPY that keeps on grinding no
> matter what is *not* ideal.

I think you handle that by putting a cap on the total number of errors
you're willing to accept (and in any event you'll always skip the row
that failed, so forward progress can't cease altogether). For out of
disk space or transaction timeout, sure, but you might also have
things like a serialization error that occurs on the reinsert that
didn't occur on the original. You don't want that to kill the whole
bulk load, I would think.

...Robert


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 22:23:03
Message-ID: 200910082223.n98MN3Y28389@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > It will be best to have the ability to have a specific rejection reason
> > for each row rejected. That way we will be able to tell the difference
> > between uniqueness violation errors, invalid date format on col7, value
> > fails check constraint on col22 etc..
>
> In case that helps, what pgloader does is logging into two files, named
> after the table name (not scalable to server-side solution):
> table.rej --- lines it could not load, straight from source file
> table.rej.log --- errors as given by the server, plus pgloader comment
>
> The pgloader comment is necessary for associating each log line to the
> source file line, as it's operating by dichotomy, the server always
> report error on line 1.
>
> The idea of having two errors file could be kept though, the aim is to
> be able to fix the setup then COPY again the table.rej file when it
> happens the errors are not on the file content. Or for loading into
> another table, with all columns as text or bytea, then clean data from a
> procedure.

What would be _cool_ would be to add the ability to have comments in the
COPY files, like \#, and then the copy data lines and errors could be
adjacent. (Because of the way we control COPY escaping, adding \# would
not be a problem. We have \N for null, for example.)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 22:35:05
Message-ID: 4ACE6919.3080609@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> What would be _cool_ would be to add the ability to have comments in the
> COPY files, like \#, and then the copy data lines and errors could be
> adjacent. (Because of the way we control COPY escaping, adding \# would
> not be a problem. We have \N for null, for example.)
>
>

That wouldn't be of any use with CSV files, which in my experience are
far more likely to be sources of data errors than input files in
Postgres Text format.

cheers

andrew


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 22:38:40
Message-ID: 200910082238.n98Mcet02140@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> Each of those features deserves a separate discussion to decide
> whether we want it and how best to implement it. Personally, I think
> we should skip (C), at least as a starting point. Instead of logging
> to a table, I think we should consider making COPY return the tuples
> indicating the error as a query result, the same way EXPLAIN returns
> the query plan. It looks like this would eliminate the need for at
> least three of the new COPY options without losing much functionality.

Can we capture EXPLAIN output into a table? If so, can the COPY error
output be sent to a table so we don't need to add this functionality to
COPY specifically?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 22:48:00
Message-ID: 200910082248.n98Mm0W03552@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> > That was a compliment on your project management skills. ?Keeping the CF
> > work moving forward steadily is both unglamorous and extremely valuable, and
> > I don't think anyone else even understands why you've volunteered to handle
> > so much of it. ?But I know I appreciate it.
>
> Thanks. I'm not sure I completely understand it, either. I'm sort of
> hoping someone else will step up to the plate, but in the absence of
> such a person I kind of hate to leave it hanging. Some defective part
> of my brain enjoys seeing things run smoothly more than it enjoys
> being lazy.

I am the same.

Also, I do get asked the "Where did this Robert Haas come from" question
occasionally. ;-)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 23:15:14
Message-ID: 1255043714.6335.6.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Thu, Oct 8, 2009 at 11:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I wonder whether we could break down COPY into sub-sub
> >> transactions to work around that...
>
> > How would that work? Don't you still need to increment the command counter?
>
> Actually, command counter doesn't help because incrementing the CC
> doesn't give you a rollback boundary between rows inserted before it
> and afterwards. What I was vaguely imaging was
>
> -- outer transaction for whole COPY
>
> -- sub-transactions that are children of outer transaction
>
> -- sub-sub-transactions that are children of sub-transactions
>
> You'd eat a sub-sub-transaction per row, and start a new sub-transaction
> every 2^32 rows.
>
> However, on second thought this really doesn't get us anywhere, it just
> moves the 2^32 restriction somewhere else. Once the outer transaction
> gets to be more than 2^31 XIDs old, the database is going to stop
> because of XID wraparound.
>
> So really we have to find some way to only expend one XID per failure,
> not one per row.

I discovered a few days back that ~550 subtransactions is sufficient to
blow max_stack_depth. 1 subtransaction per error doesn't allow many
errors.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Selena Deckelmann <selenamarie(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 23:30:50
Message-ID: 1255044650.6335.15.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 18:23 -0400, Bruce Momjian wrote:
> Dimitri Fontaine wrote:
> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > > It will be best to have the ability to have a specific rejection reason
> > > for each row rejected. That way we will be able to tell the difference
> > > between uniqueness violation errors, invalid date format on col7, value
> > > fails check constraint on col22 etc..
> >
> > In case that helps, what pgloader does is logging into two files, named
> > after the table name (not scalable to server-side solution):
> > table.rej --- lines it could not load, straight from source file
> > table.rej.log --- errors as given by the server, plus pgloader comment
> >
> > The pgloader comment is necessary for associating each log line to the
> > source file line, as it's operating by dichotomy, the server always
> > report error on line 1.
> >
> > The idea of having two errors file could be kept though, the aim is to
> > be able to fix the setup then COPY again the table.rej file when it
> > happens the errors are not on the file content. Or for loading into
> > another table, with all columns as text or bytea, then clean data from a
> > procedure.
>
> What would be _cool_ would be to add the ability to have comments in the
> COPY files, like \#, and then the copy data lines and errors could be
> adjacent. (Because of the way we control COPY escaping, adding \# would
> not be a problem. We have \N for null, for example.)

That was my idea also until I heard Dimitri's two file approach.

Having a pristine data file and a matching error file means you can
potentially just resubmit the error file again. Often you need to do
things like trap RI errors and then resubmit them at a later time once
the master rows have entered the system.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 06:29:04
Message-ID: 1255069744.6335.32.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2009-10-09 at 00:15 +0100, Simon Riggs wrote:
> On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
> >
> > You'd eat a sub-sub-transaction per row, and start a new sub-transaction
> > every 2^32 rows.
> >
> > However, on second thought this really doesn't get us anywhere, it just
> > moves the 2^32 restriction somewhere else. Once the outer transaction
> > gets to be more than 2^31 XIDs old, the database is going to stop
> > because of XID wraparound.
> >
> > So really we have to find some way to only expend one XID per failure,
> > not one per row.
>
> I discovered a few days back that ~550 subtransactions is sufficient to
> blow max_stack_depth. 1 subtransaction per error doesn't allow many
> errors.

Not meaning to come up with problems, nor direct them at Tom, this is
just a convenient place to put in a few thoughts.

Another thing that has occurred to me is that RI checks are currently
resolved at end of statement and could end up rejecting any/all rows
loaded. If we break down the load into subtransaction pieces we would
really want the RI checks on the rows to be performed during the
subtransaction that makes them. The good thing about that is that it
would lend itself to holding successful checks in a hash table to allow
a fast path optimization of continual re-checks of same values.

--
Simon Riggs www.2ndQuadrant.com


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 09:10:55
Message-ID: 1255079455.981.1.camel@hvost1700
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 11:32 -0400, Robert Haas wrote:

> Another possible approach, which isn't perfect either, is the idea of
> allowing COPY to generate a single column of output of type text[].
> That greatly reduces the number of possible error cases,

maybe make it bytea[] to further reduce error cases caused by charset
incompatibilities ?

> and at least
> gets the data into the DB where you can hack on it. But it's still
> going to be painful for some use cases.
>
> ...Robert

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 13:37:34
Message-ID: 6859.1255095454@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Thu, 2009-10-08 at 12:21 -0400, Tom Lane wrote:
>> So really we have to find some way to only expend one XID per failure,
>> not one per row.

> I discovered a few days back that ~550 subtransactions is sufficient to
> blow max_stack_depth. 1 subtransaction per error doesn't allow many
> errors.

I assume you are talking about 550 nested levels, not 550 sequential
subtransactions, so that doesn't seem particularly relevant.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 13:44:42
Message-ID: 6960.1255095882@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Another thing that has occurred to me is that RI checks are currently
> resolved at end of statement and could end up rejecting any/all rows
> loaded. If we break down the load into subtransaction pieces we would
> really want the RI checks on the rows to be performed during the
> subtransaction that makes them. The good thing about that is that it
> would lend itself to holding successful checks in a hash table to allow
> a fast path optimization of continual re-checks of same values.

If we did that it would guarantee that cases like self-referential RI
would fail. (Think parent-child links in a tree, for example.)
I see the point about wishing that such checks would be part of the
per-row error handling, but we can't just unconditionally do things
that way.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 13:52:18
Message-ID: 7082.1255096338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)2ndQuadrant(dot)com> writes:
> On Thu, 2009-10-08 at 11:32 -0400, Robert Haas wrote:
>> Another possible approach, which isn't perfect either, is the idea of
>> allowing COPY to generate a single column of output of type text[].
>> That greatly reduces the number of possible error cases,

> maybe make it bytea[] to further reduce error cases caused by charset
> incompatibilities ?

That seems likely to be considerably less convenient and more error
prone, as it now puts it on the user to do the correct conversion.

It does bring up an interesting point for error handling though, which
is what do we do with rows that fail encoding conversion? For logging
to a file we could/should just decree that we write out the original,
allegedly-in-the-client-encoding data. I'm not sure what we do about
logging to a table though. The idea of storing bytea is pretty
unpleasant but there might be little choice.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-09 15:42:19
Message-ID: alpine.GSO.2.01.0910091133490.29520@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 9 Oct 2009, Tom Lane wrote:

> what do we do with rows that fail encoding conversion? For logging to a
> file we could/should just decree that we write out the original,
> allegedly-in-the-client-encoding data. I'm not sure what we do about
> logging to a table though. The idea of storing bytea is pretty
> unpleasant but there might be little choice.

I think this detail can get punted as documented and the error logged, but
not actually handled perfectly. In most use cases I've seen here, saving
the rows to the "reject" file/table is a convenience rather than a hard
requirement anyway. You can always dig them back out of the original
again if you see an encoding error in the logs, and it's rare you can
completely automate that anyway.

The main purpose of the reject file/table is to accumulate things you
might fix by hand or systematic update (i.e. add ",\N" for a missing
column when warranted) before trying a re-import for review. I suspect
the users of this feature would be OK with knowing that can't be 100%
accurate in the face of encoding errors. It's more important that in the
usual case, things like bad delimiters and missing columns, that you can
easily manipulate the rejects as simple text. Making that harder just for
this edge case wouldn't match the priorities of the users of this feature
I've encountered.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby and 64+ subxids (was COPY enhancements)
Date: 2009-10-12 13:06:33
Message-ID: 1255352793.15590.82.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 11:50 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > Subcommitting every single row is going to be really painful,
> > especially after Hot Standby goes in and we have to issue a WAL record
> > after every 64 subtransactions (AIUI).
>
> Yikes ... I had not been following that discussion, but that sure sounds
> like a deal-breaker. For HS, not this.

Probably worth expanding this thought...

HS writes a WAL record for subtransactions at the point that the subxid
cache overflows for any single transaction. Current cache size = 64.
Top-level transaction then writes one additional WAL record every
additional 64 subxids after that. These are known as xid assignment
records.

If we execute transactions that completely fit in subxid cache we don't
write any WAL records at all. There is no cumulative effect. So in most
applications, we never write xid assignment records at all.

Does that cover your objection, or do you see other issues?

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-12 13:13:15
Message-ID: 1255353195.15590.93.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-10-08 at 11:01 -0400, Tom Lane wrote:

> So as far as I can see, the only form of COPY error handling that
> wouldn't be a cruel joke is to run a separate subtransaction for each
> row, and roll back the subtransaction on error. Of course the
> problems
> with that are (a) speed, (b) the 2^32 limit on command counter IDs
> would mean a max of 2^32 rows per COPY, which is uncomfortably small
> these days. Previous discussions of the problem have mentioned trying
> to batch multiple rows per subtransaction to alleviate both issues.
> Not easy of course, but that's why it's not been done yet. With a
> patch like this you'd also have (c) how to avoid rolling back the
> insertions into the logging table.

(d) using too many xids will force the system to begin immediate
wraparound-avoidance vacuuming to freeze rows.

Dimitri's pgloader is looking even more attractive, not least because it
exists and it works. (And is the reason I personally stopped considering
the COPY-error-logging feature as important).

--
Simon Riggs www.2ndQuadrant.com


From: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-13 13:57:44
Message-ID: 4AD48758.7090502@frogthinker.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Ultimately, there's always going to be a tradeoff between speed and
> flexibility. It may be that we should just say "if you want to import
> dirty data, it's gonna cost ya" and not worry about the speed penalty
> of subtransaction-per-row. But that still leaves us with the 2^32
> limit. I wonder whether we could break down COPY into sub-sub
> transactions to work around that...
>
Regarding that tradeoff between speed and flexibility I think we could
propose multiple options:
- maximum speed: current implementation fails on first error
- speed with error logging: copy command fails if there is an error but
continue to log all errors
- speed with error logging best effort: no use of sub-transactions but
errors that can safely be trapped with pg_try/catch (no index violation,
no before insert trigger, etc...) are logged and command can complete
- pre-loading (2-phase copy): phase 1: copy good tuples into a [temp]
table and bad tuples into an error table. phase 2: push good tuples to
destination table. Note that if phase 2 fails, it could be retried since
the temp table would be dropped only on success of phase 2.
- slow but flexible: have every row in a sub-transaction -> is there any
real benefits compared to pg_loader?

Tom was also suggesting 'refactoring COPY into a series of steps that
the user can control'. What would these steps be? Would that be per row
and allow to discard a bad tuple?

Emmanuel

--
Emmanuel Cecchet
FTO @ Frog Thinker
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu(at)frogthinker(dot)org
Skype: emmanuel_cecchet


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-13 14:35:15
Message-ID: 28375.1255444515@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)frogthinker(dot)org> writes:
> - speed with error logging best effort: no use of sub-transactions but
> errors that can safely be trapped with pg_try/catch (no index violation,

There aren't any. You can *not* put a try/catch around arbitrary code
without a subtransaction. Don't even think about it.

regards, tom lane


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Emmanuel Cecchet <manu(at)frogthinker(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-13 16:30:27
Message-ID: 87aazv8dek.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)frogthinker(dot)org> writes:
> Tom was also suggesting 'refactoring COPY into a series of steps that the
> user can control'. What would these steps be? Would that be per row and
> allow to discard a bad tuple?

The idea is to have COPY usable from a general SELECT query so that the
user control what happens. Think of an SRF returning bytea[] or some
variation on the theme.

Maybe WITH to the rescue:

WITH csv AS (
-- no error here as the destination table is in memory tuple store,
-- assuming we have adunstan patch to ignore rows with too few or
-- too many columns
COPY csv(a, b, c, d) FROM STDIN WITH CSV HEADER --- and said options
)
INSERT INTO destination
SELECT a, b, f(a + b - d), strange_timestamp_reader(c)
FROM csv
WHERE validity_check_passes(a, b, c, d);

That offers complete control to the user about the stages that transform
the data. In a previous thread some ideas I forgot the details offered
to the users some more control, but I don't have the time right now to
search in the archives.

Regards,
--
Dimitri Fontaine
PostgreSQL DBA, Architecte


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-13 16:34:31
Message-ID: 4AD4AC17.9010606@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Emmanuel Cecchet <manu(at)frogthinker(dot)org> writes:
>
>> - speed with error logging best effort: no use of sub-transactions but
>> errors that can safely be trapped with pg_try/catch (no index violation,
>>
>
> There aren't any. You can *not* put a try/catch around arbitrary code
> without a subtransaction. Don't even think about it.
>
Well then why the tests provided with the patch are working?
I hear you when you say that it is not a generally applicable idea, but
it seems that at least a couple of errors can be trapped with this
mechanism.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-13 17:10:05
Message-ID: 9416.1255453805@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
> Tom Lane wrote:
>> There aren't any. You can *not* put a try/catch around arbitrary code
>> without a subtransaction. Don't even think about it.
>>
> Well then why the tests provided with the patch are working?

Because they carefully exercise only a tiny fraction of the system.
The key word in my sentence above is "arbitrary". You don't know what
a datatype input function might try to do, let alone triggers or other
functions that COPY might have to invoke. They might do things that
need to be cleaned up after, and subtransaction rollback is the only
mechanism we have for that.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-19 04:40:30
Message-ID: 9362e74e0910182140t53d817ber2eaea815a4de4948@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Actually i thought of a solution for the wrap-around sometime back. Let me
try to put my initial thoughts into it. May be it would get refined over
conversation.

Transaction wrap-around failure

Actually this problem is present even in today's transaction id scenario and
the only way we avoid is by using freezing. Can we use a similar approach?
This freezing should mean that we are freezing the sub-transaction in order
to avoid the sub-transaction wrap around failure. I think when we use a
sub-transaction, the tuple would have xmin as the sub-transaction id(correct
me, if i am wrong). If the tuple insertion becomes successful, we will make
it equal to the parent transaction id. This way, we get a chance to re-use
the sub-transaction id, previously used. This would mean accessing the tuple
again after the sub-transaction completes

On the recovery front, the freezing should get logged into redo. With this
approach, we need only one sub-transaction id for the entire copy. If insert
gets successful for a copied row, we goto the tuple again and sub-freeze a
tuple. If it gets un-successful, we rollback the sub-transaction. But for
every un-successful transaction, we need a transaction id. May be in order
to avoid this, we can have one transaction id to mark the failures and
freeze all the failed rows for that transaction id.

I am just throwing out an idea for consideration.

Thanks,
Gokul.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-19 15:21:48
Message-ID: 20091019152148.GA3352@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram escribió:

> Actually this problem is present even in today's transaction id scenario and
> the only way we avoid is by using freezing. Can we use a similar approach?
> This freezing should mean that we are freezing the sub-transaction in order
> to avoid the sub-transaction wrap around failure.

This would mean we would have to go over the data inserted by the
subtransaction and mark it as "subxact frozen". Some sort of sub-vacuum
if you will (because it obviously needs to work inside a transaction).
Doesn't sound real workable to me.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-19 15:34:30
Message-ID: 603c8f070910190834u6169c3f9jed2ad2fecf24e574@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 19, 2009 at 11:21 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Gokulakannan Somasundaram escribió:
>
>> Actually this problem is present even in today's transaction id scenario and
>> the only way we avoid is by using freezing. Can we use a similar approach?
>> This freezing should mean that we are freezing the sub-transaction in order
>> to avoid the sub-transaction wrap around failure.
>
> This would mean we would have to go over the data inserted by the
> subtransaction and mark it as "subxact frozen".  Some sort of sub-vacuum
> if you will (because it obviously needs to work inside a transaction).
> Doesn't sound real workable to me.

Especially because the XID consumed by the sub-transaction would still
be consumed, advancing the global XID counter. Reclaiming the XIDs
after the fact doesn't fix anything as far as I can see.

...Robert


From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-20 19:00:47
Message-ID: 4ADE08DF.7010605@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,
> Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
>
>> Tom Lane wrote:
>>
>>> There aren't any. You can *not* put a try/catch around arbitrary code
>>> without a subtransaction. Don't even think about it
>> Well then why the tests provided with the patch are working?
>>
> Because they carefully exercise only a tiny fraction of the system.
> The key word in my sentence above is "arbitrary". You don't know what
> a datatype input function might try to do, let alone triggers or other
> functions that COPY might have to invoke. They might do things that
> need to be cleaned up after, and subtransaction rollback is the only
> mechanism we have for that.
>
Is it possible to use the try/catch mechanism to detect errors and log
them and only abort the command at the end of the processing?
This would have the advantage of being to be able to log all errors
without the overhead of subtransactions and even add some additional
information on where the error happened (parsing, constraints, trigger,
...) for further automated treatment. The result would still be clean
since we would abort the COPY command in case of an error as it does
currently (but we would not stop on the first error).
I guess that it would make more sense to log to a file than to a table
in that case.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-20 19:17:13
Message-ID: 3922.1256066233@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Emmanuel Cecchet <manu(at)asterdata(dot)com> writes:
> Tom Lane wrote:
>> The key word in my sentence above is "arbitrary". You don't know what
>> a datatype input function might try to do, let alone triggers or other
>> functions that COPY might have to invoke. They might do things that
>> need to be cleaned up after, and subtransaction rollback is the only
>> mechanism we have for that.

> Is it possible to use the try/catch mechanism to detect errors and log
> them and only abort the command at the end of the processing?

No, I wouldn't trust that. The point here is that various backend
subsystems (eg, locks, buffers) might need to be cleaned up after an
error before they can be used further. It might look like it works,
until you stumble across the cases where it doesn't. An example of the
sort of situation I'm concerned about is somebody throwing an elog
while holding a buffer lock. If you try to keep processing and clean
up later, it will work fine ... until the day comes that the subsequent
processing chances to try to lock that same buffer again, and then the
backend will freeze up.

We've got twenty years worth of code development that assumes that
transaction abort will clean up anything that was left hanging when
an error was thrown. It was difficult enough getting that to work
for subtransaction abort as well as main transaction abort. It's
not happening for anything less than a subtransaction abort.

This approach is a good tradeoff most of the time: the code is simpler,
much more reliable, and faster in the normal case than it would be if
we tried to do post-error cleanup differently. Error-tolerant COPY
is going to pay for all that, but we're not revisiting the decision.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Standby and 64+ subxids (was COPY enhancements)
Date: 2009-11-10 01:01:51
Message-ID: 4AF8BB7F.1030305@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-10-08 at 11:50 -0400, Tom Lane wrote:
>
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>
>>> Subcommitting every single row is going to be really painful,
>>> especially after Hot Standby goes in and we have to issue a WAL record
>>> after every 64 subtransactions (AIUI).
>>>
>> Yikes ... I had not been following that discussion, but that sure sounds
>> like a deal-breaker. For HS, not this.
>>
>
> Probably worth expanding this thought...
>
> HS writes a WAL record for subtransactions at the point that the subxid
> cache overflows for any single transaction. Current cache size = 64.
> Top-level transaction then writes one additional WAL record every
> additional 64 subxids after that. These are known as xid assignment
> records.
>
> If we execute transactions that completely fit in subxid cache we don't
> write any WAL records at all. There is no cumulative effect. So in most
> applications, we never write xid assignment records at all.
>
> Does that cover your objection, or do you see other issues?
>
>

I don't recall seeing an answer to this, and I can't find one on the
list archives either. Is it no longer an issue?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Standby and 64+ subxids (was COPY enhancements)
Date: 2009-11-10 03:03:22
Message-ID: 16276.1257822202@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Simon Riggs wrote:
>> HS writes a WAL record for subtransactions at the point that the subxid
>> cache overflows for any single transaction. Current cache size = 64.
>> Top-level transaction then writes one additional WAL record every
>> additional 64 subxids after that. These are known as xid assignment
>> records.

> I don't recall seeing an answer to this, and I can't find one on the
> list archives either. Is it no longer an issue?

I'm still concerned about it, but realistically the subxids would be
writing other WAL records too, so it's probably not as bad as I thought
at first.

regards, tom lane