Re: COPY enhancements

From: Emmanuel Cecchet <manu(at)asterdata(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COPY enhancements
Date: 2009-09-10 21:32:12
Message-ID: 4AA9705C.5020108@asterdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Josh,

See the answers inlined.
> Thank you for tackling this very long-time TODO.
>
>> Error logging is described here:
>> http://wiki.postgresql.org/wiki/Error_logging_in_COPY
>>
>
> Questions & Comments:
>
> A) Why would someone want to turn error_logging on, but leave
> error_logging_skip_tuples off? The pg_log already logs errors which copy throws by default.
>
When error_logging is on and skip_tuples is off, errors are logged in
the error table. If skip_tuples is on, tuples are not logged in the
error table.
> B) As I mentioned earlier, we'll want to provide the option of logging
> to a file instead of to a table. That's not a reason to reject this
> patch, but probably a TODO for 8.5.
>
Ok but what should be the format of that file?
> C) Are we sure we want to handle this via GUCs rather than extensions to
> COPY syntax? It seems like fairly often users would want to log
> different COPY sources to different tables/files.
>
I agree that new COPY options could be easier to use, the implementation
is just more complex. However, the labels allows you to select the
tuples related to specific COPY commands.
> D) These GUCs are userset, I hope? (haven't dug into the code far
> enough to tell yet).
>
Yes.
> E) What is error_logging_tuple_label for? You don't explain/give
> examples. And how is error_logging_tuple_partition_key used?
>
We use the label and partition key in Aster products to easily retrieve
which COPY command on which partition did generate the bad tuples. By
default, the tuple_label contains the COPY command that was executed
(see example on Wiki) and the key contains the index of the tuple in the
source file (see example on Wiki).
> F) Rawdata for rejected tuples is presumably BYTEA?
>
Yes. I forgot to put back the table description that can be seen in the
unit tests. I have updated the Wiki with the table definition.
> G) We should probably have a default for error_logging_table_name, such
> as pg_copy_errors. Does that table get automatically created if it
> doesn't exist?
>
Yes, as indicated on the wiki the table is created automatically (see
config variable section).
> H) Finally, one request of the TODO is some way to halt import after a
> specified number of bad tuples because it probably means you have the
> wrong file or wrong table. Do we still want that?
>
We can still do that. It can be another GUC variable or an option to
COPY. If the COPY command fails, everything gets rolled back (data in
the destination table and error table). That would be harder to
implement with a file (the rollback part).
>> Autopartitioning is described here:
>> http://wiki.postgresql.org/wiki/Auto-partitioning_in_COPY
>>
>
> M) tuple_routing_in_copy should take "on" or "off", not 0 or 1.
>
Ok.
> N) Have you measured the overhead & speed of this kind of COPY as
> opposed to COPY into a single table? Have you checked the overhead if
> tuple_routing_in_copy is on, but you are not loading into a partitioned
> table?
>
Yes. There is no noticeable overhead if there is no routing to do (but
routing is on).
If routing is involved, the overhead depends on how sorted your input
data is. If it all goes to the same partition, the caching effect works
well and there is no noticeable overhead. The cost is in the constraint
check and it depends on the complexity of the constraint. The more
constraints you have to check and the more complex they are, the more
overhead on each tuple routing.
> O) Is this capable of dealing with partitioning by more than one column,
> or by an expression?
>
Yes, we just use a brute force technique where we try all child tables
1-by-1 and rely on the existing Postgres constraint checking mechanism
(no new or duplicated code there).
> Finally, I'm going to suggest different names for the GUCs, as the names
> you've chosen don't group well and would likely cause confusion. Here
> are my suggestions, which all begin with "copy_" for prefix matching:
>
> error_logging --> probaby not needed, see able
> error_logging_skip_tuples --> copy_skip_bad_rows
> error_logging_schema_name --> copy_logging_schema_name
> error_logging_relation_name --> copy_logging_table_name
> error_logging_tuple_label --> don't know what this is for, see above
> error_logging_tuple_partition_key --> don't know what this is for, see above
>
> tuple_routing_in_copy --> copy_partitioning
> tuple_routing_cache_size --> copy_partitioning_cache_size
>
This makes sense. I'll add that on my todo list.

Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2009-09-10 22:11:03 Re: COPY enhancements
Previous Message Pavel Stehule 2009-09-10 21:13:21 Re: RfD: more powerful "any" types