Re: COPY enhancements

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 12:40:36
Message-ID: 1255005636.26302.357.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Wed, 2009-10-07 at 22:30 -0400, Robert Haas wrote:
> On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> > Robert,
> >
> > Here is the new version of the patch that applies to CVS HEAD as of this
> > morning.
> >
> > Emmanuel
>
> I took a look at this patch tonight and, having now read through some
> of it, I have some more detailed comments.
>
> It seems quite odd to me that when COPY succeeds but there are errors,
> the transaction commits. The only indication that some of my data
> didn't end up in the table is that the output says "COPY n" where n is
> less than the total number of rows I attempted to copy. On the other
> hand, it would be equally bad if the transaction aborted, because then
> my error logging table would not get populated - I note that that's
> the behavior we do get if the max errors threshold is exceeded. I'm
> not sure what the right answer is here, but it needs some thought and
> discussion. I think at a minimum the caller needs some indication of
> the number of FAILED rows, not just the number of succesful ones.
>
> What's really bad about this is that a flag called "error_logging" is
> actually changing the behavior of the command in a way that is far
> more dramatic than (and doesn't actually have much to do with) error
> logging. It's actually making a COPY command succeed that would
> otherwise have failed. That's not just a logging change.
>
> I think the underlying problem is that there is more than one feature
> here, and they're kind of mixed together. The fact that the
> partitioning code needs to be separated from the error logging stuff
> has already been discussed, but I think it actually needs to be broken
> down even further. I think there are three separate features in the
> error logging code:
>
> A. Ability to ignore a certain number of errors (of certain types?)
> and still get the other tuples into the table anyway.
> B. Ability to return error information in a structured format rather
> than as an exception message.
> C. Ability to direct the structured error messages to a table.
>
> Each of those features deserves a separate discussion to decide
> whether we want it and how best to implement it. Personally, I think
> we should skip (C), at least as a starting point. Instead of logging
> to a table, I think we should consider making COPY return the tuples
> indicating the error as a query result, the same way EXPLAIN returns
> the query plan. It looks like this would eliminate the need for at
> least three of the new COPY options without losing much functionality.
>
> I also think that, if possible, (A) and (B) should be implemented as
> separate features, so that each one can be used separately or both can
> be used in combination.

I don't see any logical issues with what has been proposed. Current COPY
allows k=0 cumulative errors before it aborts. This patch allows us to
provide a parameter to vary k. If we get nerrors > k then the COPY
should abort. If nerrors <= k then the COPY can commit. Exactly what we
need. Perhaps the parameter should simply be called max_errors rather
than error_logging, so the meaning is clear.

We must have detailed error reporting for each row individually,
otherwise we will not be able to correct the errors. 125346 rows loaded,
257 errors is not really useful information, so I wouldn't bother trying
to change the return info to supply the number of errors. If you defined
an error file you know you need to check it. If it doesn't exist, no
errors. Simple.

For me, A and B are inseparable. Dimitri's idea to have an error file
and a error reason file seems very good to me.

C may be important because of security concerns regarding writing error
files but it isn't critical in first commit.

(Not really in reply to Robert: Most database loaders define
max_num_bad_rows as a percentage of the size of the file. Otherwise you
need to read the file to see how big it is before you decide an
appropriate setting, which if it is very big is a bad idea).

--
Simon Riggs www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-10-08 13:42:28 Re: COPY enhancements
Previous Message Marko Tiikkaja 2009-10-08 12:40:10 Re: Writeable CTEs and side effects