Re: Bulkloading using COPY - ignore duplicates?

Lists: pgsql-hackers
From: "Jim Buttafuoco" <jim(at)buttafuoco(dot)net>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-16 14:12:14
Message-ID: 200112161412.fBGECER20364@dual.buttafuoco.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I agree with Lee, I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.

Jim

> Peter Eisentraut writes:
> > I think allowing this feature would open up a world of new
> > dangerous ideas, such as ignoring check contraints or foreign keys
> > or magically massaging other tables so that the foreign keys are
> > satisfied, or ignoring default values, or whatever. The next step
> > would then be allowing the same optimizations in INSERT. I feel
> > COPY should load the data and that's it. If you don't like the
> > data you have then you have to fix it first.
>
> I agree that PostgreSQL's checks during COPY are a bonus and I
> wouldn't dream of not having them. Many database systems provide a
> fast bulkload by ignoring these constraits and cross references -
> that's a tricky/horrid situation.
>
> However I suppose the question is should such 'invalid data' abort the
> transaction, it seems a bit drastic...
>
> I suppose i'm not really after a IGNORE DUPLICATES option, but rather
> a CONTINUE ON ERROR kind of thing.
>
> Regards, Lee.
>
>


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Jim Buttafuoco <jim(at)buttafuoco(dot)net>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-16 22:24:04
Message-ID: Pine.LNX.4.30.0112161731350.641-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Buttafuoco writes:

> I agree with Lee, I also like Oracle's options for a discard file, so
> you can look at what was rejected, fix your problem and reload if
> necessary just the rejects.

How do you know which one is the duplicate and which one is the good one?
More likely you will have to fix the entire thing. Anything else would
undermine the general data model except in specific use cases.

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Jim Buttafuoco <jim(at)buttafuoco(dot)net>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-17 12:43:45
Message-ID: 15389.59521.558937.59993@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut writes:
> Jim Buttafuoco writes:
> > I agree with Lee, I also like Oracle's options for a discard file, so
> > you can look at what was rejected, fix your problem and reload if
> > necessary just the rejects.
> How do you know which one is the duplicate and which one is the good one?
> More likely you will have to fix the entire thing. Anything else would
> undermine the general data model except in specific use cases.

In the general case most data is sequential, in which case it would be
normal to assume that the first record is the definitive one. Most
database systems go with this assumption apart from MySQL which gives
the user a choice between IGNORE or UPDATE...

Lee.


From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Jim Buttafuoco <jim(at)buttafuoco(dot)net>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-17 12:48:30
Message-ID: 15389.59806.534505.201283@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut writes:
> Jim Buttafuoco writes:
> > I agree with Lee, I also like Oracle's options for a discard file, so
> > you can look at what was rejected, fix your problem and reload if
> > necessary just the rejects.
> How do you know which one is the duplicate and which one is the good one?
> More likely you will have to fix the entire thing. Anything else would
> undermine the general data model except in specific use cases.

Consider SELECT DISTINCT - which is the 'duplicate' and which one is
the good one?

Lee.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Jim Buttafuoco <jim(at)buttafuoco(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-17 20:59:01
Message-ID: Pine.LNX.4.30.0112171817590.642-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Lee Kindness writes:

> Consider SELECT DISTINCT - which is the 'duplicate' and which one is
> the good one?

It's not the same thing. SELECT DISTINCT only eliminates rows that are
completely the same, not only equal in their unique contraints.

Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning
that the result of that statement are random unless ORDER BY is used. --
But that's not the same thing either. We've never claimed that the COPY
input has an ordering assumption. In fact you're asking for a bit more
than an ordering assumption, you're saying that the earlier data is better
than the later data. I think in a random use case that is more likely
*not* to be the case because the data at the end is newer.

Btw., here's another concern about this proposed feature: If I do a
client-side COPY, how will you sent the "ignored" rows back to the client?

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 10:09:14
Message-ID: 15391.5578.336203.295826@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut writes:
> Lee Kindness writes:
> > Consider SELECT DISTINCT - which is the 'duplicate' and which one is
> > the good one?
> It's not the same thing. SELECT DISTINCT only eliminates rows that are
> completely the same, not only equal in their unique contraints.
> Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning
> that the result of that statement are random unless ORDER BY is used. --
> But that's not the same thing either. We've never claimed that the COPY
> input has an ordering assumption. In fact you're asking for a bit more
> than an ordering assumption, you're saying that the earlier data is better
> than the later data. I think in a random use case that is more likely
> *not* to be the case because the data at the end is newer.

You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
using it as an example of where the database is choosing (be it
randomly) the data to discarded. While I've said in this thread that
'COPY FROM IGNORE DUPLICATES' would ignore later duplicates I'm not
really that concerned about what it ignores; first, later, random,
... I agree if it was of concern then it should be pre-processed.

> Btw., here's another concern about this proposed feature: If I do
> a client-side COPY, how will you sent the "ignored" rows back to
> the client?

Again a number of different ideas have been mixed up in the
discussion. Oracle's logging option was only given as an example of
how other database systems deal with this option - If it wasn't
explicitly given then it's reasonable to discard the extra
information.

What really would be nice in the SQL-world is a standardised COPY
statement...

Best regards, Lee Kindness.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 15:04:08
Message-ID: 13185.1008687848@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
> using it as an example of where the database is choosing (be it
> randomly) the data to discarded.

Not a good example to support your argument. The entire point of
DISTINCT ON (imho) is that the rows that are kept or discarded are
*not* random, but can be selected by the user by specifying additional
sort columns. DISTINCT ON would be pretty useless if it weren't for
that flexibility. The corresponding concept in COPY will need to
provide flexible means for deciding which row to keep and which to
drop, else it'll be pretty useless.

regards, tom lane


From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 15:51:50
Message-ID: 15391.26134.348061.302157@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane writes:
> Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
> > using it as an example of where the database is choosing (be it
> > randomly) the data to discarded.
> Not a good example to support your argument. The entire point of
> DISTINCT ON (imho) is that the rows that are kept or discarded are
> *not* random, but can be selected by the user by specifying additional
> sort columns. DISTINCT ON would be pretty useless if it weren't for
> that flexibility. The corresponding concept in COPY will need to
> provide flexible means for deciding which row to keep and which to
> drop, else it'll be pretty useless.

At which point it becomes quicker to resort to INSERT...

Here's the crux question - how can I get management to go with
PostgreSQL when a core operation (import of data into a transient
database) is at least 6 times slower than the current version?

With a lot of work investigating the incoming data, the number of
incoming duplicates has been massively reduced by fixing/tackling at
source. However rouge values do still crop up (the data originates
from a real-time system with multiple hardware inputs from multiple
hardware vendors) and when they do (even just 1) the performance dies.
Add to this terrabytes of legacy data...

While you may see the option of ignoring duplicates in COPY as 'pretty
useless', it obviously has its place/use otherwise every other
database system wouldn't have support for it! (not that following the
pack is always a good idea)

In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!

Best regards,

--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 15:59:12
Message-ID: 13570.1008691152@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> In an ideal world 'COPY FROM' would only be used with data output by
> 'COPY TO' and it would be nice and sanitised. However in some fields
> this often is not a possibility due to performance constraints!

Of course, the more bells and whistles we add to COPY, the slower it
will get, which rather defeats the purpose no?

regards, tom lane


From: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2001-12-18 16:04:13
Message-ID: 15391.26877.931767.773950@elsick.csl.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane writes:
> Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > In an ideal world 'COPY FROM' would only be used with data output by
> > 'COPY TO' and it would be nice and sanitised. However in some fields
> > this often is not a possibility due to performance constraints!
> Of course, the more bells and whistles we add to COPY, the slower it
> will get, which rather defeats the purpose no?

Indeed, but as I've mentioned in this thread in the past, the code
path for COPY FROM already does a check against the unique index (if
there is one) but bombs-out rather than handling it...

It wouldn't add any execution time if there were no duplicates in the
input!

regards, Lee.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 21:09:36
Message-ID: 200201022109.g02L9aW27520@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Lee Kindness wrote:
> Tom Lane writes:
> > Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > > In an ideal world 'COPY FROM' would only be used with data output by
> > > 'COPY TO' and it would be nice and sanitised. However in some fields
> > > this often is not a possibility due to performance constraints!
> > Of course, the more bells and whistles we add to COPY, the slower it
> > will get, which rather defeats the purpose no?
>
> Indeed, but as I've mentioned in this thread in the past, the code
> path for COPY FROM already does a check against the unique index (if
> there is one) but bombs-out rather than handling it...
>
> It wouldn't add any execution time if there were no duplicates in the
> input!

I know many purists object to allowing COPY to discard invalid rows in
COPY input, but it seems we have lots of requests for this feature, with
few workarounds except pre-processing the flat file. Of course, if they
use INSERT, they will get errors that they can just ignore. I don't see
how allowing errors in COPY is any more illegal, except that COPY is one
command while multiple INSERTs are separate commands.

Seems we need to allow such a capability, if only crudely. I don't
think we can create a discard file because of the problem with remote
COPY.

I think we can allow something like:

COPY FROM '/tmp/x' WITH ERRORS 2

meaning we will allow at most two errors and will report the error line
numbers to the user. I think this syntax clearly indicates that errors
are being accepted in the input. An alternate syntax would allow an
unlimited number of errors:

COPY FROM '/tmp/x' WITH ERRORS

The errors can be non-unique errors, or even CHECK constraint errors.

Unless I hear complaints, I will add it to TODO.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 21:49:27
Message-ID: 6298.1010008167@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I think we can allow something like:
> COPY FROM '/tmp/x' WITH ERRORS 2

This is not going to happen, at least not until after there's a
wholesale revision of error handling. As things stand we do not
have a choice: elog(ERROR) must abort the transaction, because we
can't guarantee that things are in good enough shape to continue.
See the archives for previous discussions.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 22:02:26
Message-ID: 200201022202.g02M2QJ01823@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I think we can allow something like:
> > COPY FROM '/tmp/x' WITH ERRORS 2
>
> This is not going to happen, at least not until after there's a
> wholesale revision of error handling. As things stand we do not
> have a choice: elog(ERROR) must abort the transaction, because we
> can't guarantee that things are in good enough shape to continue.
> See the archives for previous discussions.

Yes, I realize we need subtransactions or something, but we should add
it to the TODO list if it is a valid request, right?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 22:18:27
Message-ID: 7632.1010009907@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I think we can allow something like:
> COPY FROM '/tmp/x' WITH ERRORS 2

> Yes, I realize we need subtransactions or something, but we should add
> it to the TODO list if it is a valid request, right?

Well, I don't like that particular API in any case. Why would I think
that 2 errors are okay and 3 are not, if I'm loading a
many-thousand-line COPY file? Wouldn't it matter *what* the errors
are, at least as much as how many there are? "Discard duplicate rows"
is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
is not the same animal at all.

As someone already remarked, the correct, useful form of such a feature
is to echo the rejected lines to some sort of output file that I can
look at afterwards. How many errors there are is not the issue.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-02 23:40:20
Message-ID: 200201022340.g02NeKd12500@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I think we can allow something like:
> > COPY FROM '/tmp/x' WITH ERRORS 2
>
> > Yes, I realize we need subtransactions or something, but we should add
> > it to the TODO list if it is a valid request, right?
>
> Well, I don't like that particular API in any case. Why would I think
> that 2 errors are okay and 3 are not, if I'm loading a
> many-thousand-line COPY file? Wouldn't it matter *what* the errors

I threw the count idea in as a possible compromise. :-)

> are, at least as much as how many there are? "Discard duplicate rows"
> is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
> is not the same animal at all.

Yes, when we have error codes, it would be nice to specify certain
errors to ignore.

> As someone already remarked, the correct, useful form of such a feature
> is to echo the rejected lines to some sort of output file that I can
> look at afterwards. How many errors there are is not the issue.

How about for TODO:

* Allow COPY to report error lines and continue; requires
nested transactions; optionally allow error codes to be specified

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-03 00:05:17
Message-ID: 8164.1010016317@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> How about for TODO:
> * Allow COPY to report error lines and continue; requires
> nested transactions; optionally allow error codes to be specified

Okay, that seems reasonable.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-03 03:24:26
Message-ID: 200201030324.g033OQe25713@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> > How about for TODO:
> > * Allow COPY to report error lines and continue; requires
> > nested transactions; optionally allow error codes to be specified
>
> Okay, that seems reasonable.

Good. Now that I think of it, nested transactions don't seem required.
We already allow pg_dump to dump a database using INSERTs, and we don't
put those inserts in a single transaction when we load them:

CREATE TABLE "test" (
"x" integer
);

INSERT INTO "test" VALUES (1);
INSERT INTO "test" VALUES (2);

Should we be wrapping these INSERTs in a transaction? Can we do COPY
with each row being its own transaction?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-03 18:24:26
Message-ID: 200201031824.g03IOQN23254@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Tom Lane wrote:
> > > How about for TODO:
> > > * Allow COPY to report error lines and continue; requires
> > > nested transactions; optionally allow error codes to be specified
> >
> > Okay, that seems reasonable.
>
> Good. Now that I think of it, nested transactions don't seem required.
> We already allow pg_dump to dump a database using INSERTs, and we don't
> put those inserts in a single transaction when we load them:
>
> CREATE TABLE "test" (
> "x" integer
> );
>
> INSERT INTO "test" VALUES (1);
> INSERT INTO "test" VALUES (2);
>
> Should we be wrapping these INSERTs in a transaction? Can we do COPY
> with each row being its own transaction?

OK, added to TODO:

o Allow COPY to report error lines and continue; optionally
allow error codes to be specified

Seems nested transactions are not required if we load each COPY line in
its own transaction, like we do with INSERT from pg_dump.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-03 20:35:31
Message-ID: 15540.1010090131@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Seems nested transactions are not required if we load each COPY line in
> its own transaction, like we do with INSERT from pg_dump.

I don't think that's an acceptable answer. Consider

BEGIN;
other stuff;
COPY ....;
other stuff;
ROLLBACK;

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Jim Buttafuoco <jim(at)buttafuoco(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulkloading using COPY - ignore duplicates?
Date: 2002-01-03 20:42:05
Message-ID: 200201032042.g03Kg5p15566@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Seems nested transactions are not required if we load each COPY line in
> > its own transaction, like we do with INSERT from pg_dump.
>
> I don't think that's an acceptable answer. Consider

Oh, very good point. "Requires nested transactions" added to TODO.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026