Re: Massive table (500M rows) update nightmare

From: Kevin Kempter <kevink(at)consistentstate(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Massive table (500M rows) update nightmare
Date: 2010-01-07 17:46:24
Message-ID: 201001071046.24305.kevink@consistentstate.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thursday 07 January 2010 09:57:59 Kevin Grittner wrote:
> Ludwik Dylag <ldylag(at)gmail(dot)com> wrote:
> > I would suggest:
> > 1. turn off autovacuum
> > 1a. ewentually tune db for better performace for this kind of
> > operation (cant not help here)
> > 2. restart database
> > 3. drop all indexes
> > 4. update
> > 5. vacuum full table
> > 6. create indexes
> > 7. turn on autovacuum
>
> I've only ever attempted something like that with a few tens of
> millions of rows. I gave up on waiting for the VACUUM FULL step
> after a few days.
>
> I some scheduled down time is acceptable (with "some" kind of hard
> to estimate accurately) the best bet would be to add the column with
> the USING clause to fill in the value. (I think that would cause a
> table rewrite; if not, then add something to the ALTER TABLE which
> would.) My impression is that the OP would rather stretch out the
> implementation than to suffer down time, which can certainly be a
> valid call.
>
> If that is the goal, then the real question is whether there's a way
> to tune the incremental updates to speed that phase. Carlo, what
> version of PostgreSQL is this? Can you show us the results of an
> EXPLAIN ANALYZE for the run of one iteration of the UPDATE?
> Information on the OS, hardware, PostgreSQL build configuration, and
> the contents of postgresql.conf (excluding all comments) could help
> us spot possible techniques to speed this up.
>
> -Kevin
>

If you can come up with an effective method of tracking updates/deletes/inserts
such as a trigger that writes the PK to a separate table upon any inserts,
updates or deletes to the table you could do something like this:

1) create new table (no indexes) with the structure you want the table to have
at the end of the process (i.e. the post-altered state) [new_tab]

2) create the insert,update,delete triggers mentioned above on the existing
table [curr_tab] and write all the PK id's that change into a 3rd table
[changed_keys]

3) kick off a process that simply does a select from curr_tab into new_tab and
populates/back-fills the new column as part of the query

4) let it run as long as it takes

5) once it's complete do this:

Create the all the indexes on the new_tab

BEGIN;

LOCK TABLE curr_tab;

DELETE from new_tab
where pk_id in (select distinct pk_id from changed_keys);

INSERT into new_tab
select * from curr_tab
where curr_tab.pk_id in (select distinct pk_id from changed_keys);

ALTER TABLE curr_tab RENAME to old_tab;

ALTER TABLE new_tab RENAME to curr_tab;

COMMIT;

Also you might want to consider partitioning this table in the process...

Once you're confident you no longer need the old table [old_tab] you can drop
it

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Carlo Stonebanks 2010-01-07 17:49:54 Re: Massive table (500M rows) update nightmare
Previous Message Kevin Grittner 2010-01-07 17:35:48 Re: "large" spam tables and performance: postgres memory parameters