Re: Need to update all my 60 million rows at once without transactional integrity

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Need to update all my 60 million rows at once without transactional integrity
Date: 2008-04-23 22:19:53
Message-ID: 60tzhs8b92.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

christian_behrens(at)gmx(dot)net writes:
> How can I make a Update of a column in a very large table for all
> rows without using the double amount of disc space and without any
> need for atomic operation?

You may need to redefine the problem.

> I have a very large table with about 60 million rows. I sometimes
> need to do a simple update to ALL rows that resets a status-flag to
> zero.

That seems like a pretty useless update to me...

Why not instead redefine the "status-we-need-to-have" to be zero?
[e.g. - don't try to turn off the Sun; instead, define "brightness" as
the standard ;-)]

> I don't need to have transactional integrity (but of course if the system crashes, there should be no data corruption. A separate flag in the file system can well save the fact that that bulk update was in progress) for this, I don't care or want a abort or "all or nothing" like SQL mandates. The idea is basically that either this update succeeds or it succeeds or - there is no "not". It must succeed. It must be tried until it works. If the update is halfway finished, that's okay.
>
> If I just do an
> UPDATE table SET flag=0;
> then Pg will make a copy of every row which must be cleaned up by vaccuum. I understand - and don't like during this specific problem - that PG is a MVCC database and will keep my old versions of the rows before the update alive. This is very bad.
>
> If I do a batched loop like this:
> UPDATE table SET flag=0 where id>=0 and id <200;
> UPDATE table SET flag=0 where id>=200 and id <400;
> UPDATE table SET flag=0 where id>=400 and id <600;
> ...
>
> then PG will seek all over my harddrive I think.

If I *were* going to do this, I'd want to do:

update table set flag=0 where flag <> 0 and id in (select id from table where flag <> 0 limit 50000);

which will go through the table 50K rows at a time.

> It would be much better if it could just start in the beginning of
> the table and work it's way towards the end. But which sort-criteria
> would satisfy this? I don't think that there is any SQL that does
> something like that.

You couldn't guarantee physical ordering anyways, so that seems
futile.

> Another ideas (and I think it's a quite good idea) would be to
> drop the column and recreate it with a new default value.

The "column drop" would be virtually instantaneous; it would
essentially be hidden from view.

But the addition of the column would rewrite the table, doubling its
size :-(.

> And I don't think I should create new columns (the old one would be
> hidden and their internal column ids lost I think) all the time,
> that might have a limit.

That's correct. It will eventually cause a problem.

> Is there any other way to go?

Step back to what value you're putting into that column, and why.

Perhaps you never need to change this value.

In the Slony-I replication system, we have something sort of similar;
we collect a table of entries that need to get applied to another
database. (Table called sl_log_1.) We *NEVER* alter its contents;
that is not necessary to indicate that data has been replicated (which
is the status of interest).

Instead of updating tuples to mark that they are processed, we instead
store information in another table that indicates up to which point in
time we have finished replicating. (It's more complex and indirect
than that, but nonetheless, it's still a fair characterization...)

So, perhaps you should be storing a per-day value in the table, and
store, somewhere else, what point you're "done up to."
--
let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/emacs.html
Last night I played a blank tape at full blast. The mime next door
went nuts.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2008-04-23 22:36:49 Re: Need to update all my 60 million rows at once without transactional integrity
Previous Message Rob Wultsch 2008-04-23 21:56:24 Re: How to modify ENUM datatypes?