Re: SQL Diff ?

Lists: pgsql-general
From: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: SQL Diff ?
Date: 2007-08-26 02:51:59
Message-ID: 200708252051.59528.kevin@kevinkempterllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi List;

I have a very large table (52million rows) - I'm creating a copy of it to rid
it of 35G worth of dead space, then I'll do a sync, drop the original table
and rename table2.

Once I have the table2 as a copy of table1 what's the best way to select all
rows that have been changed, modified in table1 since the initial laod from
table1 into table2?

Also I'll need to delete any rows in table2 that no longer remain in table1.

There is no change date column
I could do something like select * from table1 where col1 || col2 || col3 etc
not in (select col1 || col2 || col3 etc from table2)

but this would be ineffecient & slow.

Anyone have a suggestion to do this in an efficient manner?

Thanks in advance

/Kevin


From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 03:10:19
Message-ID: 46D0EF1B.6010303@cox.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/25/07 21:51, Kevin Kempter wrote:
> Hi List;
>
> I have a very large table (52million rows) - I'm creating a copy of it to rid
> it of 35G worth of dead space, then I'll do a sync, drop the original table
> and rename table2.

What is your definition of "dead space"?

Bad rows, duplicate rows, old rows? Something else?

> Once I have the table2 as a copy of table1 what's the best way to select all
> rows that have been changed, modified in table1 since the initial laod from
> table1 into table2?
>
> Also I'll need to delete any rows in table2 that no longer remain in table1.
>
> There is no change date column
> I could do something like select * from table1 where col1 || col2 || col3 etc
> not in (select col1 || col2 || col3 etc from table2)
>
> but this would be ineffecient & slow.
>
> Anyone have a suggestion to do this in an efficient manner?
>
> Thanks in advance

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG0O8bS9HxQb37XmcRAsWdAJoDI/WvdaGSO7CUkUa74xHoRgycIwCguLXv
Msw0TdJyYI4keoq+ofu4j3c=
=Gi/f
-----END PGP SIGNATURE-----


From: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 03:21:09
Message-ID: 200708252121.10042.kevin@kevinkempterllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Saturday 25 August 2007 21:10:19 Ron Johnson wrote:
> On 08/25/07 21:51, Kevin Kempter wrote:
> > Hi List;
> >
> > I have a very large table (52million rows) - I'm creating a copy of it to
> > rid it of 35G worth of dead space, then I'll do a sync, drop the original
> > table and rename table2.
>
> What is your definition of "dead space"?
>
> Bad rows, duplicate rows, old rows? Something else?

deleted rows that should have been cleaned up with vacuum, problem is the
client let it go so long that now I cant get a vacuum to finish cause it
impacts the day2day operations too much. Long story, see my recent questions
on the performance list for more info.

>
> > Once I have the table2 as a copy of table1 what's the best way to select
> > all rows that have been changed, modified in table1 since the initial
> > laod from table1 into table2?
> >
> > Also I'll need to delete any rows in table2 that no longer remain in
> > table1.
> >
> > There is no change date column
> > I could do something like select * from table1 where col1 || col2 || col3
> > etc not in (select col1 || col2 || col3 etc from table2)
> >
> > but this would be ineffecient & slow.
> >
> > Anyone have a suggestion to do this in an efficient manner?
> >
> > Thanks in advance


From: brian <brian(at)zijn-digital(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 04:17:48
Message-ID: 46D0FEEC.4030605@zijn-digital.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Kempter wrote:
> Hi List;
>
> I have a very large table (52million rows) - I'm creating a copy of it to rid
> it of 35G worth of dead space, then I'll do a sync, drop the original table
> and rename table2.
>
> Once I have the table2 as a copy of table1 what's the best way to select all
> rows that have been changed, modified in table1 since the initial laod from
> table1 into table2?
>
> Also I'll need to delete any rows in table2 that no longer remain in table1.
>

Set log_statement to 'mod' (or 'ddl' if you expect any changes to the
schema itself) in postgresql.conf. Then it's just a matter of parsing
out the mods in the log.


From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 05:49:39
Message-ID: 46D11473.5050103@cox.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/25/07 22:21, Kevin Kempter wrote:
> On Saturday 25 August 2007 21:10:19 Ron Johnson wrote:
>> On 08/25/07 21:51, Kevin Kempter wrote:
>>> Hi List;
>>>
>>> I have a very large table (52million rows) - I'm creating a copy of it to
>>> rid it of 35G worth of dead space, then I'll do a sync, drop the original
>>> table and rename table2.
>> What is your definition of "dead space"?
>>
>> Bad rows, duplicate rows, old rows? Something else?
>
> deleted rows that should have been cleaned up with vacuum, problem is the
> client let it go so long that now I cant get a vacuum to finish cause it
> impacts the day2day operations too much. Long story, see my recent questions
> on the performance list for more info.

OK.

>
>>> Once I have the table2 as a copy of table1 what's the best way to select
>>> all rows that have been changed, modified in table1 since the initial
>>> laod from table1 into table2?

Is this a 24x7 database?

>>> Also I'll need to delete any rows in table2 that no longer remain in
>>> table1.
>>>
>>> There is no change date column
>>> I could do something like select * from table1 where col1 || col2 || col3
>>> etc not in (select col1 || col2 || col3 etc from table2)
>>>
>>> but this would be ineffecient & slow.
>>>
>>> Anyone have a suggestion to do this in an efficient manner?
>>>
>>> Thanks in advance

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG0RRzS9HxQb37XmcRAtuhAJ0TEBIHJwvcR58iU4MvjV2CYMvvfQCdFnkJ
ThDbGY8dXJ2MoqOaNHInqdU=
=GF05
-----END PGP SIGNATURE-----


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: "Kevin Kempter" <kevin(at)kevinkempterllc(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 14:02:02
Message-ID: 758d5e7f0708260702w61cdce5awd295897efb97d672@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/26/07, Kevin Kempter <kevin(at)kevinkempterllc(dot)com> wrote:
> On Saturday 25 August 2007 21:10:19 Ron Johnson wrote:
> > On 08/25/07 21:51, Kevin Kempter wrote:
> > > Hi List;
> > >
> > > I have a very large table (52million rows) - I'm creating a copy of it to
> > > rid it of 35G worth of dead space, then I'll do a sync, drop the original
> > > table and rename table2.
> >
> > What is your definition of "dead space"?
> >
> > Bad rows, duplicate rows, old rows? Something else?
>
> deleted rows that should have been cleaned up with vacuum, problem is the
> client let it go so long that now I cant get a vacuum to finish cause it
> impacts the day2day operations too much. Long story, see my recent questions
> on the performance list for more info.

In your place I would do something like Slony-I does, when
it replicates the tables. Create on insert/update/delete triggers
on table1 which will log operations on table1 to some table1_log
table. Then copy table1 to table2. Then replay table1_log on
table2, then BEGIN;LOCK tablel1;finish replaying the lock;DROP table1;
alter table rename...;commit;

Or perhaps actually use Slony-I for the above steps? Should work
quite nicely... Or perhaps use SkyTools for it (I've never used it)?

Regarda,
Dawid


From: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-26 15:45:01
Message-ID: 200708260945.02351.kevin@kevinkempterllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Saturday 25 August 2007 23:49:39 Ron Johnson wrote:
> On 08/25/07 22:21, Kevin Kempter wrote:
> > On Saturday 25 August 2007 21:10:19 Ron Johnson wrote:
> >> On 08/25/07 21:51, Kevin Kempter wrote:
> >>> Hi List;
> >>>
> >>> I have a very large table (52million rows) - I'm creating a copy of it
> >>> to rid it of 35G worth of dead space, then I'll do a sync, drop the
> >>> original table and rename table2.
> >>
> >> What is your definition of "dead space"?
> >>
> >> Bad rows, duplicate rows, old rows? Something else?
> >
> > deleted rows that should have been cleaned up with vacuum, problem is the
> > client let it go so long that now I cant get a vacuum to finish cause it
> > impacts the day2day operations too much. Long story, see my recent
> > questions on the performance list for more info.
>
> OK.
>
> >>> Once I have the table2 as a copy of table1 what's the best way to
> >>> select all rows that have been changed, modified in table1 since the
> >>> initial laod from table1 into table2?
>
> Is this a 24x7 database?

Yes. with little room for extra overhead

>
> >>> Also I'll need to delete any rows in table2 that no longer remain in
> >>> table1.
> >>>
> >>> There is no change date column
> >>> I could do something like select * from table1 where col1 || col2 ||
> >>> col3 etc not in (select col1 || col2 || col3 etc from table2)
> >>>
> >>> but this would be ineffecient & slow.
> >>>
> >>> Anyone have a suggestion to do this in an efficient manner?
> >>>
> >>> Thanks in advance


From: Erik Jones <erik(at)myemma(dot)com>
To: Dawid Kuroczko <qnex42(at)gmail(dot)com>
Cc: "Kevin Kempter" <kevin(at)kevinkempterllc(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-27 02:34:30
Message-ID: F370BEA6-D9E4-4B03-99E7-16A042A8B820@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On Aug 26, 2007, at 9:02 AM, Dawid Kuroczko wrote:

> On 8/26/07, Kevin Kempter <kevin(at)kevinkempterllc(dot)com> wrote:
>> On Saturday 25 August 2007 21:10:19 Ron Johnson wrote:
>>> On 08/25/07 21:51, Kevin Kempter wrote:
>>>> Hi List;
>>>>
>>>> I have a very large table (52million rows) - I'm creating a copy
>>>> of it to
>>>> rid it of 35G worth of dead space, then I'll do a sync, drop the
>>>> original
>>>> table and rename table2.
>>>
>>> What is your definition of "dead space"?
>>>
>>> Bad rows, duplicate rows, old rows? Something else?
>>
>> deleted rows that should have been cleaned up with vacuum, problem
>> is the
>> client let it go so long that now I cant get a vacuum to finish
>> cause it
>> impacts the day2day operations too much. Long story, see my
>> recent questions
>> on the performance list for more info.
>
> In your place I would do something like Slony-I does, when
> it replicates the tables. Create on insert/update/delete triggers
> on table1 which will log operations on table1 to some table1_log
> table. Then copy table1 to table2. Then replay table1_log on
> table2, then BEGIN;LOCK tablel1;finish replaying the lock;DROP table1;
> alter table rename...;commit;
>
> Or perhaps actually use Slony-I for the above steps? Should work
> quite nicely... Or perhaps use SkyTools for it (I've never used it)?
>

Yeah, for trigger based replication it'd be simpler to just use Slony-
I or Skytools. However, if you're on 8.2, with row-wise comparisons,
you could do something like:

begin;
lock table1;

insert into table2
select *
from table1
where id not in (select id from test2);

drop table1;
alter table2 rename to table1;
commit;

Here id is your primary key. Note that if your ids are generated by
a sequence you'll need to use setval on the sequence to get it
"caught up" before that commit or you'll get duplicate key errors
immediately.

Erik Jones

Software Developer | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com


From: Alban Hertroys <alban(at)magproductions(dot)nl>
To: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: SQL Diff ?
Date: 2007-08-27 12:48:04
Message-ID: 46D2C804.8000201@magproductions.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Kevin Kempter wrote:
> Hi List;
>
> I have a very large table (52million rows) - I'm creating a copy of it to rid
> it of 35G worth of dead space, then I'll do a sync, drop the original table
> and rename table2.
>
> Once I have the table2 as a copy of table1 what's the best way to select all
> rows that have been changed, modified in table1 since the initial laod from
> table1 into table2?

I think you could get smart having a few rules for insert/update/delete
on 'table' that "keep track" of what happens during your work on table2.

--
Alban Hertroys
alban(at)magproductions(dot)nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //