Re: delete vs insert vs update due to primary key dups -> which is better

Lists: pgsql-general
From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: pgsql-general(at)postgreSQL(dot)org
Subject: delete vs insert vs update due to primary key dups -> which is better
Date: 2007-08-28 05:21:28
Message-ID: 1188278488.3046.54.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Continuining with my efforts to get similar functionality as mysql's
mysqlimport --replace I want to ask for the list's opinion on which is
better

What currently is happening
1. select from mssql (into CSV via PerlDBI)
2. psql\copy into PG
3. pg chokes on duplicate pkeys as there's no --replace option

alternatives:
1. pgloader -> does not help to resolve dup pkeys. (my limited testing
also found that it doesn't do the rejects into a specified location)
(see previous posts)

2. rules on tables (create rule replace as on insert to where exists
(select 1 from xxx do instead update xxx etc..) to switch from insert to
update rule and load from a temp_table. (\copy to temp_table)
--> Does not help for \copy as it bypasses rules.
--> harder to maintain due to complexities of rules and based on what I
read/heard (IRC) is not as good an option.
(see previous posts)

3. \copy to temp_table, do
--> update main_table set value=t2.value from temp_table t2 where
main_table.pkey = t2.pkey
(alternative yet better way per depezs is append "and (main_table.value
is distinct from t2.value)" which looks only for distinct values
--> insert into main_table select * from temp_table t2 where not exists
(select * from main_table where main_table.pkey = t2.pkey)

4. \copy to temp_table, do
--> delete from main_table where main_table.pkey in (select t2.pkey from
temp_table t2)
--> insert into main_table select * from temp_table

btw, pkey was actually a multiple column composite key which I've
concatenated together as varchar :

eg: select a+cast(b as varchar)+cast(c as
varchar)+cast(datediff(s,'19700101',timestamp) as pkey from
mssql_server_table

to make it simpler for the insertion/update/deletion

Option 1 and 2 is already out.. now, left with option 3 and 4.

Option 3 seems to be better because it will leave less dead tuples due
to MVCC but the update procedure will end up to be a bit of a
maintenance nightmare each time a new column is added. (and it is added
quite frequently!! alas!)

Option 4 is simpler, delete all the duplicate p_keys in the main_table
before inserting it with the newest data (temp_table will _always_ have
the newest unique data for that table)

Only thing I'm worried about option 4 is, MVCC and MVCC (i guess) and
the dead-tuples and the additional space.

Vacuum often???

Please, I would like your comments on this. Thanks in Advance.

On average, daily will have between 5K to 250K of unique rows


From: Richard Huxton <dev(at)archonet(dot)com>
To: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
Cc: pgsql-general(at)postgreSQL(dot)org
Subject: Re: delete vs insert vs update due to primary key dups -> which is better
Date: 2007-08-28 07:19:17
Message-ID: 46D3CC75.5060502@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ow Mun Heng wrote:
> Continuining with my efforts to get similar functionality as mysql's
> mysqlimport --replace I want to ask for the list's opinion on which is
> better

I would suggest #3

> 3. \copy to temp_table, do
> --> update main_table set value=t2.value from temp_table t2 where
> main_table.pkey = t2.pkey
> (alternative yet better way per depezs is append "and (main_table.value
> is distinct from t2.value)" which looks only for distinct values
> --> insert into main_table select * from temp_table t2 where not exists
> (select * from main_table where main_table.pkey = t2.pkey)

Don't forget to run a DELETE phase too, for rows that are no longer in
the import but are in your database.

> btw, pkey was actually a multiple column composite key which I've
> concatenated together as varchar :
>
> eg: select a+cast(b as varchar)+cast(c as
> varchar)+cast(datediff(s,'19700101',timestamp) as pkey from
> mssql_server_table
>
> to make it simpler for the insertion/update/deletion

Not sure I understand this bit. What's gets complicated with a
multi-column key?

> Option 3 seems to be better because it will leave less dead tuples due
> to MVCC

It's also better because it's what you actually want to do (in a logical
sense, you want to update existing rows). That means if you ever attach
triggers to the table they will behave as you expect.

> but the update procedure will end up to be a bit of a
> maintenance nightmare each time a new column is added. (and it is added
> quite frequently!! alas!)

If you're trying to keep two databases in sync then you need a process
to manage that. Don't try and make your import handle it - deal with it
properly. Scan the table definitions from both servers and compare them
column-by-column, adding and removing as required. THEN copy the data over.

> Option 4 is simpler, delete all the duplicate p_keys in the main_table
> before inserting it with the newest data (temp_table will _always_ have
> the newest unique data for that table)
>
> Only thing I'm worried about option 4 is, MVCC and MVCC (i guess) and
> the dead-tuples and the additional space.
>
> Vacuum often???
>
> Please, I would like your comments on this. Thanks in Advance.
>
> On average, daily will have between 5K to 250K of unique rows

Yes, you should make sure you are vacuuming at least once per import
cycle. However, as long as you have enough space in your free-space-map,
any dead-tuple-space will be re-used.

The other option would be #5:
1. Copy whole table from MSSQL to PG, but under new name (t1_tmp)
2. Rename t1 => t1_old, t1_tmp => t1
3. Drop t1_old
If you don't have any foreign keys referring to the table, or views
depending on it then that is probably simplest, although least efficient.

--
Richard Huxton
Archonet Ltd


From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgreSQL(dot)org
Subject: Re: delete vs insert vs update due to primary key dups -> which is better
Date: 2007-08-28 07:41:45
Message-ID: 1188286905.3046.81.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2007-08-28 at 08:19 +0100, Richard Huxton wrote:
> Ow Mun Heng wrote:
> > Continuining with my efforts to get similar functionality as mysql's
> > mysqlimport --replace I want to ask for the list's opinion on which is
> > better
>
> I would suggest #3
>
> > 3.\copy ; update ; insert
>
> Don't forget to run a DELETE phase too, for rows that are no longer in
> the import but are in your database.

NO Delete will _ever_ be done (hopefully if there's enough disk space)
for archival purposes and tracking etc.

Theoretically, all the imported rows are either new rows or updated
rows.
>
> > btw, pkey was actually a multiple column composite key which I've
> > concatenated together as varchar :
> > to make it simpler for the insertion/update/deletion
>
> What's gets complicated with a multi-column key?

less typing per insert/update statement so it'll be
where a.pkey = b.pkey instead of a.key1 = b.key1 and a.key2 = b.key2
and ... up to key5

> > Option 3 seems to be better because it will leave less dead tuples due
> > to MVCC
>
> It's also better because it's what you actually want to do (in a logical
> sense, you want to update existing rows). That means if you ever attach
> triggers to the table they will behave as you expect.

Actually, what I want is mysqlimport --replace functionality. Which is
just replace the whole row w/ new data. This DB will never have to deal
w/ updates to specific rows/columns.

No triggering is in the plan. if there will be triggering, it will be to
denormalise the table once data import is finished. This, if it comes
down to it, will be triggered via perl-DBI per import cycle.

which is why I like option 4

> If you're trying to keep two databases in sync then you need a process
> to manage that. Don't try and make your import handle it - deal with it
> properly.

Nope.. Not trying to do that. No more anyway.. New DB will be just a
subset of the main_db's Data. Whatever that is needed by the engineers
and nothing more. (new columns = new requests for "new" columns to be
added)

>
> >Option 4 is simpler, delete all the duplicate p_keys in the
> >main_table before inserting it with the newest data (temp_table will
> >_always_ have the newest unique data for that table)

Besides being the illogical way of doing things. What else is wrong with
option 4?

> >
> > On average, daily will have between 5K to 250K of unique rows
>
> Yes, you should make sure you are vacuuming at least once per import
> cycle. However, as long as you have enough space in your free-space-map,
> any dead-tuple-space will be re-used.

an import cycle will pull between 100 - 1000 rows. I don't think a
vaccum per import cycle is needed. Do I?

>
> The other option would be #5:
> 1. Copy whole table from MSSQL to PG, but under new name (t1_tmp)
> 2. Rename t1 => t1_old, t1_tmp => t1
> 3. Drop t1_old
> If you don't have any foreign keys referring to the table, or views
> depending on it then that is probably simplest, although least efficient.

Actually, can't do that. Main DB data is being deleted everyday. I want
to archive the old data (to be used for analysis etc) as well.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
Cc: pgsql-general(at)postgreSQL(dot)org
Subject: Re: delete vs insert vs update due to primary key dups -> which is better
Date: 2007-08-28 08:10:50
Message-ID: 46D3D88A.8070703@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ow Mun Heng wrote:
> On Tue, 2007-08-28 at 08:19 +0100, Richard Huxton wrote:
>> Ow Mun Heng wrote:
>>> Continuining with my efforts to get similar functionality as mysql's
>>> mysqlimport --replace I want to ask for the list's opinion on which is
>>> better
>> I would suggest #3
>>
>>> 3.\copy ; update ; insert
>> Don't forget to run a DELETE phase too, for rows that are no longer in
>> the import but are in your database.
>
> NO Delete will _ever_ be done (hopefully if there's enough disk space)
> for archival purposes and tracking etc.
>
> Theoretically, all the imported rows are either new rows or updated
> rows.

OK, so it's more a history table.

>>> btw, pkey was actually a multiple column composite key which I've
>>> concatenated together as varchar :
>>> to make it simpler for the insertion/update/deletion
>> What's gets complicated with a multi-column key?
>
> less typing per insert/update statement so it'll be
> where a.pkey = b.pkey instead of a.key1 = b.key1 and a.key2 = b.key2
> and ... up to key5

I'd still leave it alone, but it's your database.

>>> Option 3 seems to be better because it will leave less dead tuples due
>>> to MVCC
>> It's also better because it's what you actually want to do (in a logical
>> sense, you want to update existing rows). That means if you ever attach
>> triggers to the table they will behave as you expect.
>
> Actually, what I want is mysqlimport --replace functionality. Which is
> just replace the whole row w/ new data. This DB will never have to deal
> w/ updates to specific rows/columns.
>
> No triggering is in the plan. if there will be triggering, it will be to
> denormalise the table once data import is finished. This, if it comes
> down to it, will be triggered via perl-DBI per import cycle.
>
> which is why I like option 4
>
>> If you're trying to keep two databases in sync then you need a process
>> to manage that. Don't try and make your import handle it - deal with it
>> properly.
>
> Nope.. Not trying to do that. No more anyway.. New DB will be just a
> subset of the main_db's Data. Whatever that is needed by the engineers
> and nothing more. (new columns = new requests for "new" columns to be
> added)

I still think you'll need a system to handle this. What do you do when
you "add" a column but the rows are deleted from the mssql server?
You'll want some human-level rules at least.

>>> Option 4 is simpler, delete all the duplicate p_keys in the
>>> main_table before inserting it with the newest data (temp_table will
>>> _always_ have the newest unique data for that table)
>
> Besides being the illogical way of doing things. What else is wrong with
> option 4?

It'll do the job just fine.

>>> On average, daily will have between 5K to 250K of unique rows
>> Yes, you should make sure you are vacuuming at least once per import
>> cycle. However, as long as you have enough space in your free-space-map,
>> any dead-tuple-space will be re-used.
>
> an import cycle will pull between 100 - 1000 rows. I don't think a
> vaccum per import cycle is needed. Do I?

Well, on a recent version I'd hope autovacuum is handling it for you.
Basically there's no point in vacuuming more than once per import
(assuming you don't have any more updates). If you vacuum once per day
then you'll need to allow space for 5000+ rows.

>> The other option would be #5:
>> 1. Copy whole table from MSSQL to PG, but under new name (t1_tmp)
>> 2. Rename t1 => t1_old, t1_tmp => t1
>> 3. Drop t1_old
>> If you don't have any foreign keys referring to the table, or views
>> depending on it then that is probably simplest, although least efficient.
>
> Actually, can't do that. Main DB data is being deleted everyday. I want
> to archive the old data (to be used for analysis etc) as well.

OK - I think I see what you're doing now.

Regarding the "new" columns - I'd be tempted to copy all the columns
over and perhaps hide them with a view at the PG end of things.

--
Richard Huxton
Archonet Ltd


From: Michael Glaesemann <grzm(at)seespotcode(dot)net>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>, pgsql-general(at)postgreSQL(dot)org
Subject: Re: delete vs insert vs update due to primary key dups -> which is better
Date: 2007-08-28 13:06:32
Message-ID: 2FA9844F-C3C7-4183-A310-9F7D5C1C1C8C@seespotcode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On Aug 28, 2007, at 3:10 , Richard Huxton wrote:

> Ow Mun Heng wrote:
>> less typing per insert/update statement so it'll be where a.pkey =
>> b.pkey instead of a.key1 = b.key1 and a.key2 = b.key2
>> and ... up to key5
>
> I'd still leave it alone, but it's your database.

And you can use the row comparison syntax;

WHERE (a.key1, a.key2, a.key3, a.key4, a.key5) = (b.key1, b.key2,
b.key3, b.key4, b.key5)

which I find very convenient as it makes the composite key much
clearer than ANDed equalities.

Michael Glaesemann
grzm seespotcode net