Re: COPY enhancements

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Emmanuel Cecchet <manu(at)asterdata(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-10-08 02:30:57
Message-ID: 603c8f070910071930t42c2a416tbef4d57b7fe3ca7f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 25, 2009 at 10:01 AM, Emmanuel Cecchet <manu(at)asterdata(dot)com> wrote:
> Robert,
>
> Here is the new version of the patch that applies to CVS HEAD as of this
> morning.
>
> Emmanuel

I took a look at this patch tonight and, having now read through some
of it, I have some more detailed comments.

It seems quite odd to me that when COPY succeeds but there are errors,
the transaction commits. The only indication that some of my data
didn't end up in the table is that the output says "COPY n" where n is
less than the total number of rows I attempted to copy. On the other
hand, it would be equally bad if the transaction aborted, because then
my error logging table would not get populated - I note that that's
the behavior we do get if the max errors threshold is exceeded. I'm
not sure what the right answer is here, but it needs some thought and
discussion. I think at a minimum the caller needs some indication of
the number of FAILED rows, not just the number of succesful ones.

What's really bad about this is that a flag called "error_logging" is
actually changing the behavior of the command in a way that is far
more dramatic than (and doesn't actually have much to do with) error
logging. It's actually making a COPY command succeed that would
otherwise have failed. That's not just a logging change.

I think the underlying problem is that there is more than one feature
here, and they're kind of mixed together. The fact that the
partitioning code needs to be separated from the error logging stuff
has already been discussed, but I think it actually needs to be broken
down even further. I think there are three separate features in the
error logging code:

A. Ability to ignore a certain number of errors (of certain types?)
and still get the other tuples into the table anyway.
B. Ability to return error information in a structured format rather
than as an exception message.
C. Ability to direct the structured error messages to a table.

Each of those features deserves a separate discussion to decide
whether we want it and how best to implement it. Personally, I think
we should skip (C), at least as a starting point. Instead of logging
to a table, I think we should consider making COPY return the tuples
indicating the error as a query result, the same way EXPLAIN returns
the query plan. It looks like this would eliminate the need for at
least three of the new COPY options without losing much functionality.

I also think that, if possible, (A) and (B) should be implemented as
separate features, so that each one can be used separately or both can
be used in combination.

Some other notes:

1. The copy_errorlogging.out regression test fails, at least on my
machine. I guess this problem was discussed previously, but perhaps
you should change the setup of the test in some way so that you can
generate a patch that doesn't require manual fiddling to get it the
regression tests to pass.

2. The error logging doesn't seem to work in all cases.

rhaas=# copy foo from '/home/rhaas/x' (error_logging,
error_logging_table_name 'errlogging');
ERROR: duplicate key value violates unique constraint "foo_pkey"
DETAIL: Key (id)=(1) already exists.
CONTEXT: COPY foo, line 1: "1 Robert"

I assume that this fails because the code is trained to catch some
specific category of errors and redirect those to the error logging
table while letting everything take its natural course. I don't feel
that that's a terribly useful behavior, and on the other hand, I'm not
sure it's going to be possible to fix it without killing performance.

3. The option checking for the error_logging options is insufficient.
For example, if ERROR_LOGGING_TABLE_NAME is specified but
ERROR_LOGGING is not specified, COPY still works (and just ignores the
ERROR_LOGGING_TABLE_NAME). Note that this is quite different from what
HEADER does, for example.

4. Specifiying an error_logging_table_name with spaces or other funny
characters in it fails.

COPY foo FROM STDIN (ERROR_LOGGING, ERROR_LOGGING_TABLE_NAME 'foo bar baz');

5. The assumption that the public schema is an appropriate default
does not seem right to me. I would think that we would want to
default to the first schema in the user's search path.

6. errorlogging.c is poorly named, because it has specifically to do
with logging errors in COPY, rather than with error logging in
general; there's also no reason for it to live in utils/misc. Also,
it does not include the same headers as the other files in the same
directory.

7. There's no need for the OidLinkedList structure; we already have a
system for handling lists of OIDs. See pg_list.h.

8. The OID cache is interesting; have you benchmarked this to see how
much it improves performance? If so, you should describe your
methodology and performance results. It might be worthwhile to leave
this out of the first version of the partitioning patch for simplicity
and add submit as a follow-on performance patch. Do we really need a
GUC to size this cache, or is there a size beyond which it doesn't
help much?

9. The style of the code is not consistent with what PostgreSQL does
elsewhere. The use of '\see' is very distracting. The style of
comments does not consistently patch the PostgreSQL style: in places
comments start with "/**", and there's one in copy.c that reads "/*
<--- Error logging function ---> */". Also, the standard way to write
a multiline comment is with the text starting on the line following
"/*", not the "/*" line itself. Also, some variables are declared as
"type * name" rather than "type *name".

10. This patch twiddles with the regression tests in ways that are not
related to the functionality it is adding; consistent with the idea
that a patch should do one thing and do it well, you should revert
this part.

Overall, I think that this patch is not very close to being
committable and accordingly I am going to mark it as Returned with
Feedback. However, I hope that the discussion will continue and that
we can get consensus on some changes for next CommitFest.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Chernow 2009-10-08 02:31:03 Re: postgres 8.3.8 and Solaris 10_x86 64 bit problems?
Previous Message Andrew Dunstan 2009-10-08 02:07:29 Re: Concurrency testing