Insert or Replace or \copy (bulkload)

Lists: pgsql-general
From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: Postgresql General List <pgsql-general(at)postgresql(dot)org>
Subject: Insert or Replace or \copy (bulkload)
Date: 2007-08-14 09:33:39
Message-ID: 1187084019.21825.31.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I'm seeing an obstacle in my aim to migrate from mysql to PG mainly from
the manner in which PG handles duplicate entries either from primary
keys or unique entries.

Data is taken from perl DBI into (right now) CSV based files to be used
via psql's \copy command to insert into the table.

In MySql, I was using mysqlimport --replace which essentially provided
the means to load data into the DB, while at the same time, would
provide the necessary logic to replace the entire row if there was a
duplicate instead of dying.

Under PG, I've yet to found something similar to this functionality and
searches via google has uncovered that this is one thing which _still_
has not found its way into PG. (anyone knows why? Standards?)

Anyway, I found a workaround, but, to me, even though it provides a
means to an end, it still looks like it'll end up as a maintenance
nightmare each time a table has any additional columns added.

Solution is taken from this site:

http://www.pointwriter.com/blog/index.php?/archives/6-REPLACE-in-PostgreSQL.html

[..snip..]
For the sake of this example we'll use a very simple table called 'map'
with a two fields: 'key' and 'value'. Not surprisingly, key is the
primary key.

Let's further assume that every insert into this table should actually
be a replace. Then all we need is this simple statement and we're done:

CREATE RULE "replace_map" AS
ON INSERT TO "map_example"
WHERE
EXISTS(SELECT 1 FROM map_example WHERE key=NEW.key)
DO INSTEAD
(UPDATE map_example SET value=NEW.value WHERE key=NEW.key)

[...snip...]

Populate with some test data
==========================
XMMS=> insert into map_example(key,value,value2) values (1,1,1);
INSERT 0 1
XMMS=> insert into map_example(key,value,value2) values (2,2,2);
INSERT 0 1
XMMS=> select * from map_example
;
key | value | value2
-----+-------+--------
1 | 1 | 1
2 | 2 | 2
(2 rows)

Try to insert some duplicate data
+=================================
XMMS=>insert into map_example(key,value,value2) values (2,20,20);
INSERT 0 0
XMMS=> select * from map_example
;
key | value | value2
-----+-------+--------
1 | 1 | 1
2 | 20 | ====>2 <===== NOT Updated
(2 rows)

Add the new rule once we added the new columns
==============================================
XMMS=>CREATE OR REPLACE RULE "replace_map" AS
ON INSERT TO "map_example"
WHERE
EXISTS(SELECT 1 FROM map_example WHERE key=NEW.key)
DO INSTEAD
(UPDATE map_example SET value=NEW.value,value2=NEW.value2 WHERE
key=NEW.key)
;

XMMS=> insert into map_example(key,value,value2) values (2,20,20);
INSERT 0 0
XMMS=> select * from map_example
;
key | value | value2
-----+-------+--------
1 | 1 | 1
2 | 20 | 20
(2 rows)

Can anyone tell me if this won't turn out to be a maintenance nightmare?
So, the pertinent question is, is there a better mousetrap available?


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Ow Mun Heng" <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
Cc: "Postgresql General List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-14 15:16:44
Message-ID: dcc563d10708140816x34037a80u84c240e3d5c49da2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/14/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> I'm seeing an obstacle in my aim to migrate from mysql to PG mainly from
> the manner in which PG handles duplicate entries either from primary
> keys or unique entries.
>
> Data is taken from perl DBI into (right now) CSV based files to be used
> via psql's \copy command to insert into the table.
>
> In MySql, I was using mysqlimport --replace which essentially provided
> the means to load data into the DB, while at the same time, would
> provide the necessary logic to replace the entire row if there was a
> duplicate instead of dying.
>
> Under PG, I've yet to found something similar to this functionality and
> searches via google has uncovered that this is one thing which _still_
> has not found its way into PG. (anyone knows why? Standards?)
>
> Anyway, I found a workaround, but, to me, even though it provides a
> means to an end, it still looks like it'll end up as a maintenance
> nightmare each time a table has any additional columns added.
>
> Solution is taken from this site:
>
> http://www.pointwriter.com/blog/index.php?/archives/6-REPLACE-in-PostgreSQL.html

Example code snipped for brevity

> Can anyone tell me if this won't turn out to be a maintenance nightmare?
> So, the pertinent question is, is there a better mousetrap available?

I don't see why it would be a maintenance nightmare. Looks pretty
much like you just create it and go. Once it's in place it should
just work. There are other ways to skin this particular cat, but that
one seems as good as any.


From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: Postgresql General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-15 02:06:50
Message-ID: 1187143610.9737.3.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2007-08-14 at 10:16 -0500, Scott Marlowe wrote:
> On 8/14/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> >
> > In MySql, I was using mysqlimport --replace which essentially provided
> > the means to load data into the DB, while at the same time, would
> > provide the necessary logic to replace the entire row if there was a
> > duplicate instead of dying.

> > Anyway, I found a workaround, but, to me, even though it provides a
> > means to an end, it still looks like it'll end up as a maintenance
> > nightmare each time a table has any additional columns added.
> >
> > Solution is taken from this site:
> >
> > http://www.pointwriter.com/blog/index.php?/archives/6-REPLACE-in-PostgreSQL.html
>
> Example code snipped for brevity
>
> > Can anyone tell me if this won't turn out to be a maintenance nightmare?
> > So, the pertinent question is, is there a better mousetrap available?
>
> I don't see why it would be a maintenance nightmare. Looks pretty
> much like you just create it and go. Once it's in place it should
> just work. There are other ways to skin this particular cat, but that
> one seems as good as any.

That would be true only if I didn't have to (remember to) add a alter
the rule each time a new column is added. At the rate of things, it
might be quite an often procedure. (unless of course, I script it, which
is an idea by itself)

Ps : Is it this list's norm to have the OP/sender in the "to" list and
mailing list on the "CC" list?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
Cc: Postgresql General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-15 02:35:03
Message-ID: 20772.1187145303@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> writes:
> Ps : Is it this list's norm to have the OP/sender in the "to" list and
> mailing list on the "CC" list?

Yes. If you don't like that you can try including a "Reply-To: <list>"
header in what you send to the list; or perhaps better, I think there's
a way to tell the list bot to not send you copies of messages you're
cc'd on. But the old-timers are all accustomed to this behavior. The
PG lists are big enough that turnaround time is O(minutes) even when
everything is going well, and sometimes it isn't --- so direct cc's to
the people involved in a specific thread help keep the conversation
moving.

Also, some of us find a backup copy sent via the list to be a Good
Thing. I for one have a reputation of running spam filters that eat
pets and small children ... so if you want to be sure to get through
to me, don't forget to cc: the list.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Ow Mun Heng" <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>, "Postgresql General List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-15 12:45:56
Message-ID: 87sl6lq83f.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> writes:
>> Ps : Is it this list's norm to have the OP/sender in the "to" list and
>> mailing list on the "CC" list?
>
> Yes. If you don't like that you can try including a "Reply-To: <list>"
> header in what you send to the list; or perhaps better, I think there's
> a way to tell the list bot to not send you copies of messages you're
> cc'd on.

Beware that setting Reply-To is actually "incorrect". It will cause people who
think they're sending you private email to unexpectedly send their private
mail to the list.

There's a correct header to set, I think Mail-Followup-To or something like
that. But sadly not all mail readers support it.

> I for one have a reputation of running spam filters that eat pets and small
> children ... so if you want to be sure to get through to me, don't forget to
> cc: the list.

They eat all my emails, but I'm sure that's intentional :)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ow Mun Heng" <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>, "Postgresql General List" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-15 14:09:31
Message-ID: dcc563d10708150709q752e5288v2ba32caa0972d964@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/15/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > I for one have a reputation of running spam filters that eat pets and small
> > children ... so if you want to be sure to get through to me, don't forget to
> > cc: the list.
>
> They eat all my emails, but I'm sure that's intentional :)

I was wondering why my kids were so quiet...


From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Postgresql General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Insert or Replace or \copy (bulkload)
Date: 2007-08-27 03:55:00
Message-ID: 1188186900.7562.14.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2007-08-14 at 10:16 -0500, Scott Marlowe wrote:
> On 8/14/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> > I'm seeing an obstacle in my aim to migrate from mysql to PG mainly from
> > the manner in which PG handles duplicate entries either from primary
> > keys or unique entries.
> >
> > Data is taken from perl DBI into (right now) CSV based files to be used
> > via psql's \copy command to insert into the table.
> >
> > In MySql, I was using mysqlimport --replace which essentially provided
> > the means to load data into the DB, while at the same time, would
> > provide the necessary logic to replace the entire row if there was a
> > duplicate instead of dying.
> >
> > Under PG, I've yet to found something similar to this functionality and
> > searches via google has uncovered that this is one thing which _still_
> > has not found its way into PG. (anyone knows why? Standards?)
> >
> > Anyway, I found a workaround, but, to me, even though it provides a
> > means to an end, it still looks like it'll end up as a maintenance
> > nightmare each time a table has any additional columns added.
> >
> > Solution is taken from this site:
> >
> > http://www.pointwriter.com/blog/index.php?/archives/6-REPLACE-in-PostgreSQL.html
>
> Example code snipped for brevity
>
> > Can anyone tell me if this won't turn out to be a maintenance nightmare?
> > So, the pertinent question is, is there a better mousetrap available?
>
> I don't see why it would be a maintenance nightmare. Looks pretty
> much like you just create it and go. Once it's in place it should
> just work. There are other ways to skin this particular cat, but that
> one seems as good as any.

I just ran into trouble with this. This rule seems to work when I do
simple inserts, but as what I will be doing will be doing \copy
bulkloads, it will balk and fail.

Is there a suitable/replacement rule which can be used for this purpose
which will work bulkload wise?

Now would be a good idea to teach me how to skin the cat differently.


From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Postgresql General List <pgsql-general(at)postgresql(dot)org>
Subject: \copy ignoring Rules Was [Re: Insert or Replace or \copy (bulkload)]
Date: 2007-08-27 05:14:35
Message-ID: 1188191675.7562.30.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 2007-08-27 at 11:55 +0800, Ow Mun Heng wrote:
> I just ran into trouble with this. This rule seems to work when I do
> simple inserts, but as what I will be doing will be doing \copy
> bulkloads, it will balk and fail.
> Now would be a good idea to teach me how to skin the cat differently.

Base on the arhives, seems like this was raised as BUG 2437 close to a
year ago and it was marked as "correct behaviour" and the suggested
alternative is to either

1. Use pgloader (DL'ed it, but have not gone indepth to see if it suits
my needs)

2. COPY into temp table, and then perform an insert into.

=>create table foo (a int, b int, c int, d varchar(1) e bigint);
=>alter table foo add primary key (a,b,c,e);
=>insert into foo values(1,1,1,'A',1);
=>create rule replace_foo as on insert to foo where (exists(select 1
from foo where foo.a = new.a and foo.b = new.b and foo.c = new.c and
foo.e = new.e)) do instead update foo set d = new.d where foo.a = new.a
and foo.b = new.b and foo.c = new.c and foo.e = new.e
=> \d foo;
Table "xmms.foo"
Column | Type | Modifiers
--------+----------------------+-----------
a | integer | not null
b | integer | not null
c | integer | not null
d | character varying(1) |
e | bigint | not null
Indexes:
"foo_pkey" PRIMARY KEY, btree (a, b, c, e)
Rules:
replace_foo AS
ON INSERT TO foo
WHERE (EXISTS ( SELECT 1
FROM foo
WHERE foo.a = new.a AND foo.b = new.b AND foo.c = new.c AND
foo.e = new.e)) DO INSTEAD UPDATE foo SET d = new.d
WHERE foo.a = new.a AND foo.b = new.b AND foo.c = new.c AND foo.e =
new.e
=>create table temp_table_foo as select * from foo where 1=0;

=> select * from foo;
a | b | c | d | e
---+---+---+---+---
1 | 1 | 1 | A | 1

=>insert into temp_table_foo values(1,1,1,'Z',1);

=> select * from foo;
a | b | c | d | e
---+---+---+---+---
1 | 1 | 1 | Z | 1

=> truncate table foo

Seems like an awful lot of step and may even make the \copy process even
longer than it should.. Is there any other method? Seems to me pgloader
also does more or less the same thing..

What about pgbulkload? Anyone can provide any new insights?

Thanks...

http://archives.postgresql.org/pgsql-bugs/2006-05/msg00073.php