Re: COPY issue(gsoc project)

Lists: pgsql-hackers
From: longlong <asfnuts(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: COPY issue(gsoc project)
Date: 2008-03-11 12:56:47
Message-ID: d9f0a46b0803110556w29795e10p40c9e5439eb4f59d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

i want to paticipate in gsoc 2008.

here is my plan:
the red part is came from Greg Smith(thanks for Greg Smith's advice).

1.release8.2 make COPY TO can copy the output of an arbitrary SELECT
statement. so i think maybe COPY FROM can get data from output and 'insert
into' some column that designated. the format of the command will be
discussed.

This would be a nice feature. Right now there are often applications
where there is a data loading or staging table that ends up being merged
with a larger table after some cleanup. Moving that data from the
preperation area into the final table right now is most easily done with
INSERT INTO X (SELECT A,B FROM C) type actions. This is slow because
INSERT takes much longer than COPY. Adding support for COPY X FROM
(SELECT A,B FROM C) would make this problem go away.

It is possible to do this right now with some clever use of STDIN/OUT like
the below, but having a pure SQL solution would be more widely applicable.
The overhead of having to pass everything through the client (as STDIN/OUT
do) is certainly not zero.

2.this come from TODO list: COPY always behaviors like a unit of work thar
consists of some insert commands, if any error, it rollback. but sometimes
we only care the data should be inserted. in that situation, i used to use
"try....catch...." insert row by row to skip the error, because it will take
much time to examine every row. so:
Allow COPY to report error lines and continue.
this is a good idea.

This is a long standing request and many people would be happy to see it
implemented. You do want to make sure the implementation easily allows
pushing all the lines that didn't commit into what's commonly called a
"reject file".

----------------------------------------------------------------------------

is these feasible?
which one should i choose to proposal or both?


From: Neil Conway <neilc(at)samurai(dot)com>
To: longlong <asfnuts(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-11 22:18:46
Message-ID: 1205273926.23742.25.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2008-03-11 at 20:56 +0800, longlong wrote:
> This would be a nice feature. Right now there are often applications
> where there is a data loading or staging table that ends up being
> merged with a larger table after some cleanup. Moving that data from
> the preperation area into the final table right now is most easily
> done with INSERT INTO X (SELECT A,B FROM C) type actions. This is
> slow because INSERT takes much longer than COPY.

Why would INSERT INTO ... SELECT be any slower than COPY ... FROM
SELECT?

> 2.this come from TODO list: COPY always behaviors like a unit of work
> thar consists of some insert commands, if any error, it rollback. but
> sometimes we only care the data should be inserted. in that situation,
> i used to use "try....catch...." insert row by row to skip the error,
> because it will take much time to examine every row. so:
> Allow COPY to report error lines and continue.
> this is a good idea.

Search the archives for prior discussions of this idea; the
implementation will require some careful thought. This is a relevant
thread:

http://markmail.org/message/y3atxu56s2afgidg

Note also that pg_bulkload currently does something analogous to this
outside of the DBMS proper:

http://pgbulkload.projects.postgresql.org/

> which one should i choose to proposal or both?

FWIW, error handling for COPY sounds like a more useful project to me.

-Neil


From: Neil Conway <neilc(at)samurai(dot)com>
To: longlong <asfnuts(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-11 22:57:44
Message-ID: 1205276264.23742.27.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2008-03-11 at 15:18 -0700, Neil Conway wrote:
> Note also that pg_bulkload currently does something analogous to this
> outside of the DBMS proper:
>
> http://pgbulkload.projects.postgresql.org/

Sorry, wrong project. I mean pgloader:

http://pgfoundry.org/projects/pgloader/

-Neil


From: longlong <asfnuts(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fwd: COPY issue(gsoc project)
Date: 2008-03-12 01:43:33
Message-ID: d9f0a46b0803111843t6e4f3b99mf7b1532fab3714ff@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

---------- Forwarded message ----------
From: longlong <asfnuts(at)gmail(dot)com>
Date: 2008-3-12 9:42
Subject: Re: [HACKERS]COPY issue(gsoc project)
To: Neil Conway <neilc(at)samurai(dot)com>

the first feature has been implementet. i used to ignore the pid column with
the command "copy tablename (columns) from ...." long time ago and i
forgeted. that's my fault.

Allow COPY FROM to create index entries in bulk
http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php

i know pg_bulkload from the link above. i don't know what is the different
between pg_bulkload and pgloader. right now i try to figure out how copy
works and get more infomation form archives for prior discussions on this
subject. meanwhile i'll focus on pgloader.


From: longlong <asfnuts(at)gmail(dot)com>
To: "Neil Conway" <neilc(at)samurai(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-14 02:45:34
Message-ID: d9f0a46b0803131945x691c9626ma673628340b2e012@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2008/3/12, Neil Conway <neilc(at)samurai(dot)com>:
>
> I don't see why creating index entries in bulk has anything to do with
> COPY vs. INSERT: if a lot of rows are being loaded into the table in a
> single command, it would be a win to create the index entries in bulk,
> regardless of whether COPY or INSERT ... SELECT is being used.
>
> In any case, the "create indexes in bulk" hasn't actually been
> implemented in mainline Postgres...
>
> I mentioned pgloader just as an example of an existing implementation of
> the "error recovery in COPY" idea. The issues with doing an
> implementation of error recovery in the backend that I see are:
>
> * in order to be sure that you can recover from an error, you
> need to abort the current subtransaction
>
> * starting and committing a subtransaction for every row of the COPY
> would be too expensive
>
> * therefore, start and commit a subtransaction for every "n" rows
> of input. If an error occurs, you lose at most "n-1" rows of
> valid input, which you need to backup and reinsert. There are
> various approaches to choosing "n" (statically, based on the
> error rate of previous batches in the same load, etc.).
>
>
> -Neil
>
>
> i think this is a better idea.
from *NikhilS *
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am
thinking that we can start off a subtraction and continue it till we
encounter a failure. The moment an error is encountered, since we have the
offending (already in heap) tuple around, we can call a simple_heap_delete
on the same and commit (instead of aborting) this subtransaction after doing
some minor cleanup. This current input data row can also be logged into a
bad file. Recall that we need to only handle those errors in which the
simple_heap_insert is successful, but the index insertion or the after row
insert trigger causes an error. The rest of the load then can go ahead with
the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want
subtransaction or every "n" rows.

you have mentioned that the 'n' can be changed according when and where the
error happened in thread *"Re: VLDB Features" .this is like some *mechanisms
in tcp Congestion Control.but you can't ignore the time wasted in
subtransaction before it encounters an error especially when the 'n' is big.

i don't know the cost of a subtransaction(begin and commit) and an copy
line reading(CopyReadLine()) exactly. so i just calculate the number of
subtransactions.
f(n)=(1-(1-p)^n)**m+m/n
m is the number of lines. n is the subtransaction lines. p is the
possibility of each row encounters an error.
*big 'n' can reduce the number of subtransaction(m/n), but also increase the
*possibility of *having a error. unless the p is extremely small, choosing a
big 'n' is a big mistake.
in fact the errors always get together (my experience), the situation may be
a little better.

however, the idea(from NikhilS) that i start with is the perfect solution.
yes i have seen in the email archives in thread *"Re: VLDB Features" and i
notice* some disagreements about commit problems and etc. this won't be a
problem since so many similar problems have been solved in pg.


From: NikhilS <nikkhils(at)gmail(dot)com>
To: longlong <asfnuts(at)gmail(dot)com>
Cc: "Neil Conway" <neilc(at)samurai(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-14 08:26:26
Message-ID: d3c4af540803140126i5c23ba0an7f4f717e9dd2cf84@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Longlong,

> > i think this is a better idea.
> from *NikhilS *
> http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
> But instead of using a per insert or a batch insert substraction, I am
> thinking that we can start off a subtraction and continue it till we
> encounter a failure. The moment an error is encountered, since we have the
> offending (already in heap) tuple around, we can call a simple_heap_delete
> on the same and commit (instead of aborting) this subtransaction after doing
> some minor cleanup. This current input data row can also be logged into a
> bad file. Recall that we need to only handle those errors in which the
> simple_heap_insert is successful, but the index insertion or the after row
> insert trigger causes an error. The rest of the load then can go ahead with
> the start of a new subtransaction.
> the simplest thing are often the best.
> i think it's hard to implement or some other deficiency since you want
> subtransaction or every "n" rows.
>

Yeah simpler things are often the best, but as folks are mentioning, we need
a carefully thought out approach here. The reply from Tom to my posting
there raises issues which need to be taken care of. Although I still think
that if we carry out *sanity* checks before starting the load about presence
of triggers, constrainsts, fkey constraints etc, if others do not have any
issues with the approach, the simple_heap_delete idea should work in some
cases. Although the term I used "after some minor cleanup" might need some
thought too now that I think more of it..

Also if Fkey checks or complex triggers are around, maybe we can fall back
to a subtransaction per row insert too as a worse case measure..

Regards,
Nikhils
--
EnterpriseDB http://www.enterprisedb.com


From: longlong <asfnuts(at)gmail(dot)com>
To: NikhilS <nikkhils(at)gmail(dot)com>
Cc: "Neil Conway" <neilc(at)samurai(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: COPY issue(gsoc project)
Date: 2008-03-20 06:00:31
Message-ID: d9f0a46b0803192300r2d1b9759sa6bf8670625f8265@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hi NikhilS.

2008/3/14, NikhilS <nikkhils(at)gmail(dot)com>:
>
> Hi Longlong,
>
>
> > > i think this is a better idea.
> > from *NikhilS *
> > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
> > But instead of using a per insert or a batch insert substraction, I am
> > thinking that we can start off a subtraction and continue it till we
> > encounter a failure. The moment an error is encountered, since we have the
> > offending (already in heap) tuple around, we can call a simple_heap_delete
> > on the same and commit (instead of aborting) this subtransaction after doing
> > some minor cleanup. This current input data row can also be logged into a
> > bad file. Recall that we need to only handle those errors in which the
> > simple_heap_insert is successful, but the index insertion or the after row
> > insert trigger causes an error. The rest of the load then can go ahead with
> > the start of a new subtransaction.
> > the simplest thing are often the best.
> > i think it's hard to implement or some other deficiency since you want
> > subtransaction or every "n" rows.
> >
>
>
> Yeah simpler things are often the best, but as folks are mentioning, we
> need a carefully thought out approach here. The reply from Tom to my posting
> there raises issues which need to be taken care of. Although I still think
> that if we carry out *sanity* checks before starting the load about presence
> of triggers, constrainsts, fkey constraints etc, if others do not have any
> issues with the approach, the simple_heap_delete idea should work in some
> cases. Although the term I used "after some minor cleanup" might need some
> thought too now that I think more of it..
>
> Also if Fkey checks or complex triggers are around, maybe we can fall back
> to a subtransaction per row insert too as a worse case measure..
>
> Regards,
> Nikhils
>
> --
> EnterpriseDB http://www.enterprisedb.com

i had seen every email of that thread before. i didn't think triggers,
constraints and fkey constraints would be a problem because at that time i
didn't know much about when and how these triggers were fired. :P
now i believe that all is about efficiency(of course error handling is a
base requirement). so i consider that a simple implement should be done, and
it can be a benchmark to measure other more advanced and efficient
implement. this will help to find a final solution.

subtransaction seems to be the only easy and feasible way to me.
as described by Neil earlier:

* in order to be sure that you can recover from an error, you
need to abort the current subtransaction

* starting and committing a subtransaction for every row of the COPY
would be too expensive

* therefore, start and commit a subtransaction for every "n" rows
of input. If an error occurs, you lose at most "n-1" rows of
valid input, which you need to backup and reinsert. There are
various approaches to choosing "n" (statically, based on the
error rate of previous batches in the same load, etc.).

i want to implement this feature as a google summer of code project. error
handling is the utmost target. test example will be generated for
efficiency analysis.

any suggestion is welcome.