Need to update all my 60 million rows at once without transactional integrity

Lists: pgsql-general
From: christian_behrens(at)gmx(dot)net
To: pgsql-general(at)postgresql(dot)org
Subject: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-20 22:19:34
Message-ID: 20080420221934.13270@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi!

How can I make a Update of a column in a very large table for all rows without using the double amount of disc space and without any need for atomic operation?

I have a very large table with about 60 million rows. I sometimes need to do a simple update to ALL rows that resets a status-flag to zero.

I don't need to have transactional integrity (but of course if the system crashes, there should be no data corruption. A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I don't care or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds or it succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway finished, that's okay.

If I just do an
UPDATE table SET flag=0;
then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this specific problem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is very bad.

If I do a batched loop like this:
UPDATE table SET flag=0 where id>=0 and id <200;
UPDATE table SET flag=0 where id>=200 and id <400;
UPDATE table SET flag=0 where id>=400 and id <600;
...

then PG will seek all over my harddrive I think.

It would be much better if it could just start in the beginning of the table and work it's way towards the end. But which sort-criteria would satisfy this? I don't think that there is any SQL that does something like that.

Another ideas (and I think it's a quite good idea) would be to
drop the column and recreate it with a new default value.

But the problem is that this is not actually MY database, but an appliance (which has a harddrive that does not have the double amount of space available btw) and it has to work reliably whenever something special happens.

And I don't think I should create new columns (the old one would be hidden and their internal column ids lost I think) all the time, that might have a limit.

Can I do it maybe every day??

Is there any other way to go?

I would really like to kind of "skip transactions". Of course basic data integrity in on disc structures, but not atomicy for this bulk update!

I read that PG has an internal command language that is used to build up a database when all the tables for e.g. table-names are not present yet.

Could I use that to hack my way around transactions?

Basically I can do everything to this PG installation, this is an extremly controlled, locked down environment. I don't need to care for anyone else, it's always the same schema, the same version, same OS, etc. and I could happily apply any hack that solves the problem.

Many thanks for any pointers or ideas how to solve this problem!
Christian
--
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser


From: "A(dot) Kretschmer" <andreas(dot)kretschmer(at)schollglas(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 08:22:55
Message-ID: 20080423082255.GE8401@a-kretschmer.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

am Mon, dem 21.04.2008, um 0:19:34 +0200 mailte christian_behrens(at)gmx(dot)net folgendes:
> Hi!
>
> How can I make a Update of a column in a very large table for all rows without using the double amount of disc space and without any need for atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes need to do a simple update to ALL rows that resets a status-flag to zero.
>
> I don't need to have transactional integrity (but of course if the system crashes, there should be no data corruption. A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I don't care or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds or it succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway finished, that's okay.
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this specific problem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is very bad.
>
> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;

Don't forget to VACUUM after every Update...

>
> Is there any other way to go?

Update to 8.3 and profit from the new HOT feature (wild guess: there is
no index on this flag-column)

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "A(dot) Kretschmer" <andreas(dot)kretschmer(at)schollglas(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 09:48:41
Message-ID: 2e78013d0804230248o58740a66l81363398758d6f37@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Apr 23, 2008 at 1:52 PM, A. Kretschmer
<andreas(dot)kretschmer(at)schollglas(dot)com> wrote:
> am Mon, dem 21.04.2008, um 0:19:34 +0200 mailte christian_behrens(at)gmx(dot)net folgendes:
>
> >
> > If I do a batched loop like this:
> > UPDATE table SET flag=0 where id>=0 and id <200;
> > UPDATE table SET flag=0 where id>=200 and id <400;
> > UPDATE table SET flag=0 where id>=400 and id <600;
>
>
> Don't forget to VACUUM after every Update...
>

VACUUMing a large table so often could a problem. But if disk space is
the only limitation and you don't care much about IO and CPU usage,
its not a bad idea.

>
> >
> > Is there any other way to go?
>
> Update to 8.3 and profit from the new HOT feature (wild guess: there is
> no index on this flag-column)
>

HOT may not help a lot in this case. HOT needs free space in the same
block to put the new version. It can recycle the previously updated
rows and thus free up space, but only if the rows were updated in an
older (now committed) transaction. Now, if you are doing batch
updates, then there is a chance that HOT may be able recycle rows
updated in one of the previous batches. But if the updates are
happening sequential, then the blocks which were updated previously
would never be touched again and hence no space will be freed.

If you are updating one row at a time (in a separate transaction) or
if the batch updates are kind of scattered, then HOT can reuse the
dead tuples and limit the bloat.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 11:49:30
Message-ID: 480F224A.50706@batory.org.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 2008-04-21 00:19, christian_behrens(at)gmx(dot)net wrote:

> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.
> UPDATE table SET flag=0;

First optimization:
UPDATE table SET flag=0 where flag!=0;

Second optimization:

> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> then PG will seek all over my harddrive I think.

Loop like this (in pseudo-code):

non0 = select count(*) from table where flag!=0;
batchsize = 1000000;
for ( i=0; i<non0; i+=batchsize) {
update table set flag=0 where id in
(select id from table where flag!=0 limit batchsize);
commit;
vacuum table;
analyze table;
}

You'll use only 1/60 of space. Will need about 3*60 table scans. But it
should not seek too much.

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 13:04:29
Message-ID: 20080423130429.GD4768@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

christian_behrens(at)gmx(dot)net wrote:

> How can I make a Update of a column in a very large table for all rows
> without using the double amount of disc space and without any need for
> atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes need
> to do a simple update to ALL rows that resets a status-flag to zero.

Perhaps you should rethink your data model.

> I would really like to kind of "skip transactions". Of course basic
> data integrity in on disc structures, but not atomicy for this bulk
> update!

There is no such thing on Postgres as "not a transaction".

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 15:18:28
Message-ID: 2e78013d0804230818v9e68a89oba1c30a5d4ed861e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Apr 21, 2008 at 3:49 AM, <christian_behrens(at)gmx(dot)net> wrote:

>
> Could I use that to hack my way around transactions?
>

Since you are asking for trouble, may there is something you can do
with Before UPDATE Triggers and heap_inplace_update(). Before you try
this out: I must say, *I have no idea if this would work in all
scenario and I don't take any guarantee of data consistency*. So do it
on your own risk :-) Obviously, transaction integrity and MVCC is
compromised. But I think crash recovery should work fine because
heap_inplace_update() takes care of WAL logging.

Write a BEFORE UPDATE trigger in C, something like this:

PG_FUNCTION_INFO_V1(inplace_update_trigger);
extern Datum inplace_update_trigger(PG_FUNCTION_ARGS);

Datum
inplace_update_trigger(PG_FUNCTION_ARGS)
{
TriggerData *trigdata = (TriggerData *)fcinfo->context;
trigdata->tg_newtuple->t_self = trigdata->tg_trigtuple->t_self;
heap_inplace_update(trigdata->tg_relation, trigdata->tg_newtuple);
return NULL;
}

CREATE OR REPLACE FUNCTION inplace_update_trigger()
RETURNS TRIGGER
AS 'trigger.so', 'inplace_update_trigger'
LANGUAGE C STRICT;

CREATE TRIGGER inplace_update_trigger BEFORE UPDATE ON <tblname>
FOR EACH ROW EXECUTE PROCEDURE inplace_update_trigger();

Now whenever you update a row in the table, the before update trigger
would update the old tuple in-place and return NULL. That would ensure
that the actual UPDATE operation is not performed, but the changes are
permanently recorded on the old tuple. In case of crash or transaction
abort, the updates can not be rolled back. Also, you may want to take
an exclusive lock on the relation before you start the update.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: "Merlin Moncure" <mmoncure(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: christian_behrens(at)gmx(dot)net, pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 15:27:14
Message-ID: b42b73150804230827w3a2fb591p9c2839f255903f54@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Apr 23, 2008 at 9:04 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> christian_behrens(at)gmx(dot)net wrote:
>
>
> > How can I make a Update of a column in a very large table for all rows
> > without using the double amount of disc space and without any need for
> > atomic operation?
> >
> > I have a very large table with about 60 million rows. I sometimes need
> > to do a simple update to ALL rows that resets a status-flag to zero.
>
> Perhaps you should rethink your data model.

for example:
*) change status-flag to a timestamp
*) make a new table (status_reset or something) with one column, one row...and
*) make a view (perhaps swapping names with your original table) that
joins the two tables and preserves the 1/0 status flag column
appearance so you don't have to change the app.

Consider the fundamental tenet: 'If PostgreSQL can't do it well, you
are probably doing it the wrong way' :-)

merlin


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 16:32:02
Message-ID: 480F6482.3020005@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

christian_behrens(at)gmx(dot)net wrote:
> Hi!
>
> How can I make a Update of a column in a very large table for all rows without using the double amount of disc space and without any need for atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes need to do a simple update to ALL rows that resets a status-flag to zero.
>
Without knowing details of your app, here are some random thoughts:

1. Use a where clause. If the number of non-zero status flags is small
portion of the table, then the table will only grow by the number of
flags that need to be reset, not the whole table.

2. Split the flag into a separate table. You will have index overhead, but:

2a. You can reset by a simple truncate followed by an INSERT into
flagtable (id,flag) SELECT rowid,0 from yourmaintable.

2b. Even if (not-recommended) you did a full update of the flagtable,
you would only be growing the usage by the size of the flagtable.

2c. You may be able to have the flagtable only store non-zero flags in
which case you could use a coalesce((SELECT flag from flagtable where
flagtable.id=manitable.id),0) to fetch the flag. Then a reset is just a
near-instantaneous truncate.

3. Partition your table - use inheritance to create a main table
consisting of many children containing the data. Depending on your app,
there may be other benefits to partitioning. But in any case, you can
update one child-table at a time. Follow the update of each sub-table
with a CLUSTER which is far faster than VACUUM FULL.

Cheers,
Steve


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 18:15:52
Message-ID: 1208974552.4259.1410.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, 2008-04-21 at 00:19 +0200, christian_behrens(at)gmx(dot)net wrote:

> How can I make a Update of a column in a very large table for all rows
> without using the double amount of disc space and without any need for
> atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes need
> to do a simple update to ALL rows that resets a status-flag to zero.
>
> I don't need to have transactional integrity (but of course if the
> system crashes, there should be no data corruption.

No such thing. Without transactions you have no sensible definition of
what constitutes data corruption.

> A separate flag in the file system can well save the fact that that
> bulk update was in progress) for this, I don't care or want a abort or
> "all or nothing" like SQL mandates. The idea is basically that either
> this update succeeds or it succeeds or - there is no "not". It must
> succeed. It must be tried until it works. If the update is halfway
> finished, that's okay.

Don't reset them to zero, just redefine the meaning of the counter. Take
the max value in the table and then have the app understand that
anything <= the previous max value means the same thing as whatever
"status = 0" means now. The app would need to read the new baseline
value before performing any work.

This is roughly the technique used by Slony to avoid needing to update
every row in the log table to show that it has successfully replicated
it. It's also the technique MVCC relies upon internally.

It's less work and crash safe in all cases.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Tom Allison <tom(at)tacocat(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "christian_behrens(at)gmx(dot)net" <christian_behrens(at)gmx(dot)net>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 21:35:14
Message-ID: BB3A6D34-4DBD-4347-BAA5-9947FE132BA9@tacocat.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Far from being an expert on postgres, but there are two ideas--
assuming that you cannot afford the time it would take to simply
UPDATE and wait...

Write a script to update all the rows, one at a time. Lowest impact to
operations but would take a very long time.

Assuming you have a sequence primary key value on each row, update by
ID blocks on the order of 10,000's or 100,000's at a time (or more).
This is a balancing act between time to complete and immediate load on
the server.

I've used both options but I don't think I've exceeded 15 million rows.

Sent from my iPhone.

On Apr 23, 2008, at 2:15 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Mon, 2008-04-21 at 00:19 +0200, christian_behrens(at)gmx(dot)net wrote:
>
>> How can I make a Update of a column in a very large table for all
>> rows
>> without using the double amount of disc space and without any need
>> for
>> atomic operation?
>>
>> I have a very large table with about 60 million rows. I sometimes
>> need
>> to do a simple update to ALL rows that resets a status-flag to zero.
>>
>> I don't need to have transactional integrity (but of course if the
>> system crashes, there should be no data corruption.
>
> No such thing. Without transactions you have no sensible definition of
> what constitutes data corruption.
>
>> A separate flag in the file system can well save the fact that that
>> bulk update was in progress) for this, I don't care or want a abort
>> or
>> "all or nothing" like SQL mandates. The idea is basically that either
>> this update succeeds or it succeeds or - there is no "not". It must
>> succeed. It must be tried until it works. If the update is halfway
>> finished, that's okay.
>
> Don't reset them to zero, just redefine the meaning of the counter.
> Take
> the max value in the table and then have the app understand that
> anything <= the previous max value means the same thing as whatever
> "status = 0" means now. The app would need to read the new baseline
> value before performing any work.
>
> This is roughly the technique used by Slony to avoid needing to update
> every row in the log table to show that it has successfully replicated
> it. It's also the technique MVCC relies upon internally.
>
> It's less work and crash safe in all cases.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 22:19:53
Message-ID: 60tzhs8b92.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

christian_behrens(at)gmx(dot)net writes:
> How can I make a Update of a column in a very large table for all
> rows without using the double amount of disc space and without any
> need for atomic operation?

You may need to redefine the problem.

> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.

That seems like a pretty useless update to me...

Why not instead redefine the "status-we-need-to-have" to be zero?
[e.g. - don't try to turn off the Sun; instead, define "brightness" as
the standard ;-)]

> I don't need to have transactional integrity (but of course if the system crashes, there should be no data corruption. A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I don't care or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds or it succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway finished, that's okay.
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this specific problem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is very bad.
>
> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> ...
>
> then PG will seek all over my harddrive I think.

If I *were* going to do this, I'd want to do:

update table set flag=0 where flag <> 0 and id in (select id from table where flag <> 0 limit 50000);

which will go through the table 50K rows at a time.

> It would be much better if it could just start in the beginning of
> the table and work it's way towards the end. But which sort-criteria
> would satisfy this? I don't think that there is any SQL that does
> something like that.

You couldn't guarantee physical ordering anyways, so that seems
futile.

> Another ideas (and I think it's a quite good idea) would be to
> drop the column and recreate it with a new default value.

The "column drop" would be virtually instantaneous; it would
essentially be hidden from view.

But the addition of the column would rewrite the table, doubling its
size :-(.

> And I don't think I should create new columns (the old one would be
> hidden and their internal column ids lost I think) all the time,
> that might have a limit.

That's correct. It will eventually cause a problem.

> Is there any other way to go?

Step back to what value you're putting into that column, and why.

Perhaps you never need to change this value.

In the Slony-I replication system, we have something sort of similar;
we collect a table of entries that need to get applied to another
database. (Table called sl_log_1.) We *NEVER* alter its contents;
that is not necessary to indicate that data has been replicated (which
is the status of interest).

Instead of updating tuples to mark that they are processed, we instead
store information in another table that indicates up to which point in
time we have finished replicating. (It's more complex and indirect
than that, but nonetheless, it's still a fair characterization...)

So, perhaps you should be storing a per-day value in the table, and
store, somewhere else, what point you're "done up to."
--
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/emacs.html
Last night I played a blank tape at full blast. The mime next door
went nuts.


From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: christian_behrens(at)gmx(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 22:36:49
Message-ID: DC4DF14B-E45B-4E61-87DD-0353FDDB2DC6@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Apr 21, 2008, at 12:19 AM, christian_behrens(at)gmx(dot)net wrote:

> Hi!
>
> How can I make a Update of a column in a very large table for all
> rows without using the double amount of disc space and without any
> need for atomic operation?
>
> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.

I'll assume they're wide rows then.

> I don't need to have transactional integrity (but of course if the
> system crashes, there should be no data corruption. A separate flag
> in the file system can well save the fact that that bulk update was
> in progress) for this, I don't care or want a abort or "all or
> nothing" like SQL mandates. The idea is basically that either this
> update succeeds or it succeeds or - there is no "not". It must
> succeed. It must be tried until it works. If the update is halfway
> finished, that's okay.
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by
> vaccuum. I understand - and don't like during this specific problem
> - that PG is a MVCC database and will keep my old versions of the
> rows before the update alive. This is very bad.

I'm not sure what you're trying to do exactly, but updating a flag on
every row you changed will double your data size, as update means an
insert/delete combination in MVCC. If there was no reason to update
the row except for changing the flag, that's probably not what you want.

You could instead use referential integrity to do this job for you.
Create a (temporary) table have a foreign key to your records (define
it as both PRIMARY key and FOREIGN key to keep a 1 to 1 relation).
Setting the flag is done by inserting a record into the temp table
referencing the record you "updated". Existence of the record in the
temp table would signify the row was changed.

This way you'll only have inserts of small rows (into the temp
table), and less than 60 million if you didn't succeed.

That still leaves the problem of disabling atomicity of that update
of course. I think you actually mean to also update data in the rows
you want to set those flags for, in which case my suggestion is of
little help...

> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> ...
>
> then PG will seek all over my harddrive I think.
>
> It would be much better if it could just start in the beginning of
> the table and work it's way towards the end. But which sort-
> criteria would satisfy this? I don't think that there is any SQL
> that does something like that.

If that table is clustered over an index, that would be a good
candidate as clustering means the data is mostly ordered on disk
according to that index.

> Another ideas (and I think it's a quite good idea) would be to
> drop the column and recreate it with a new default value.
>
> But the problem is that this is not actually MY database, but an
> appliance (which has a harddrive that does not have the double
> amount of space available btw) and it has to work reliably whenever
> something special happens.
>
> And I don't think I should create new columns (the old one would be
> hidden and their internal column ids lost I think) all the time,
> that might have a limit.
>
> Can I do it maybe every day??
>
>
> Is there any other way to go?
>
> I would really like to kind of "skip transactions". Of course basic
> data integrity in on disc structures, but not atomicy for this bulk
> update!
>
> I read that PG has an internal command language that is used to
> build up a database when all the tables for e.g. table-names are
> not present yet.
>
> Could I use that to hack my way around transactions?
>
> Basically I can do everything to this PG installation, this is an
> extremly controlled, locked down environment. I don't need to care
> for anyone else, it's always the same schema, the same version,
> same OS, etc. and I could happily apply any hack that solves the
> problem.
>
> Many thanks for any pointers or ideas how to solve this problem!
> Christian
> --
> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten
> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,480fba09927662091310159!


From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Chris Browne" <cbbrowne(at)acm(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-24 00:02:30
Message-ID: D425483C2C5C9F49B5B7A41F8944154701000DEE@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-
> owner(at)postgresql(dot)org] On Behalf Of Chris Browne
> Sent: Wednesday, April 23, 2008 3:20 PM
> To: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Need to update all my 60 million rows at once
> without transactional integrity
>
> christian_behrens(at)gmx(dot)net writes:
> > How can I make a Update of a column in a very large table for all
> > rows without using the double amount of disc space and without any
> > need for atomic operation?
>
> You may need to redefine the problem.
>
> > I have a very large table with about 60 million rows. I sometimes
> > need to do a simple update to ALL rows that resets a status-flag to
> > zero.
>
> That seems like a pretty useless update to me...
>
> Why not instead redefine the "status-we-need-to-have" to be zero?
> [e.g. - don't try to turn off the Sun; instead, define "brightness" as
> the standard ;-)]
>
> > I don't need to have transactional integrity (but of course if the
> system crashes, there should be no data corruption. A separate flag in
the
> file system can well save the fact that that bulk update was in
progress)
> for this, I don't care or want a abort or "all or nothing" like SQL
> mandates. The idea is basically that either this update succeeds or it
> succeeds or - there is no "not". It must succeed. It must be tried
until
> it works. If the update is halfway finished, that's okay.
> >
> > If I just do an
> > UPDATE table SET flag=0;
> > then Pg will make a copy of every row which must be cleaned up by
> vaccuum. I understand - and don't like during this specific problem -
that
> PG is a MVCC database and will keep my old versions of the rows before
the
> update alive. This is very bad.
> >
> > If I do a batched loop like this:
> > UPDATE table SET flag=0 where id>=0 and id <200;
> > UPDATE table SET flag=0 where id>=200 and id <400;
> > UPDATE table SET flag=0 where id>=400 and id <600;
> > ...
> >
> > then PG will seek all over my harddrive I think.
>
> If I *were* going to do this, I'd want to do:
>
> update table set flag=0 where flag <> 0 and id in (select id from
table
> where flag <> 0 limit 50000);
>
> which will go through the table 50K rows at a time.
>
> > It would be much better if it could just start in the beginning of
> > the table and work it's way towards the end. But which sort-criteria
> > would satisfy this? I don't think that there is any SQL that does
> > something like that.
>
> You couldn't guarantee physical ordering anyways, so that seems
> futile.
>
> > Another ideas (and I think it's a quite good idea) would be to
> > drop the column and recreate it with a new default value.
>
> The "column drop" would be virtually instantaneous; it would
> essentially be hidden from view.
>
> But the addition of the column would rewrite the table, doubling its
> size :-(.
>
> > And I don't think I should create new columns (the old one would be
> > hidden and their internal column ids lost I think) all the time,
> > that might have a limit.
>
> That's correct. It will eventually cause a problem.
>
> > Is there any other way to go?
>
> Step back to what value you're putting into that column, and why.
>
> Perhaps you never need to change this value.
>
> In the Slony-I replication system, we have something sort of similar;
> we collect a table of entries that need to get applied to another
> database. (Table called sl_log_1.) We *NEVER* alter its contents;
> that is not necessary to indicate that data has been replicated (which
> is the status of interest).
>
> Instead of updating tuples to mark that they are processed, we instead
> store information in another table that indicates up to which point in
> time we have finished replicating. (It's more complex and indirect
> than that, but nonetheless, it's still a fair characterization...)
>
> So, perhaps you should be storing a per-day value in the table, and
> store, somewhere else, what point you're "done up to."

I have a notion here...

If there is a table which gets some subset of its columns completely
rewritten every day, then I suggest taking the completely rewritten
columns and make them into a child table. Take the primary key of the
current table with the problem and clone that primary key for the child
table.

E.g.:

Table foo has columns:
a,b,c,d,e,f,g,h,I,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
We update p,t,y every day, 100%
Primary key of foo is a,b

I suggest creating bar (a,b,p,t,y) and remove p,t,y from main table foo
and rename table foo to foo_1

Create a view that has the same name as foo and the columns of the
original foo table by joining tables foo_1 and bar.

Every time you repopulate the table, drop and recreate bar, COPY INTO
and then recreate bar's index on (a,b).

Your applications will work the same because the view foo has the same
properties as the original table foo.

You could even populate the bar table offline, rename the old bar table
to bar_old and rename the new populated table from bar_new to bar and so
even the apparent downtime would be very small.