Re: reclaiming disk space after major updates

Lists: pgsql-adminpgsql-performance
From: Dan Harris <fbsd(at)drivefaster(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Subject: reclaiming disk space after major updates
Date: 2007-06-06 22:04:44
Message-ID: 46672F7C.4020400@drivefaster.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

Our usage pattern has recently left me with some very bloated database clusters.
I have, in the past, scheduled downtime to run VACUUM FULL and tried CLUSTER
as well, followed by a REINDEX on all tables. This does work, however the
exclusive lock has become a real thorn in my side. As our system grows, I am
having trouble scheduling enough downtime for either of these operations or a
full dump/reload. I do run VACUUM regularly, it's just that sometimes we need
to go back and update a huge percentage of rows in a single batch due to
changing customer requirements, leaving us with significant table bloat.

So within the last few days my db cluster has grown from 290GB to 370GB and
because of some other major data updates on my TO-DO list, I expect this to
double and I'll be bumping up against my storage capacity.

The root of my question is due to my not understanding why the tables can't be
in read-only mode while one of these is occurring? Since most of our usage is
OLAP, this really wouldn't matter much as long as the users could still query
their data while it was running. Is there some way I can allow users read-only
access to this data while things are cleaned up in the background? INSERTs can
wait, SELECTs cannot.

So how do other people handle such a problem when downtime is heavily frowned
upon? We have 24/7 access ( but again, the users only read data ).


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: reclaiming disk space after major updates
Date: 2007-06-07 19:20:25
Message-ID: 20070607192025.GO14087@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

On Wed, Jun 06, 2007 at 04:04:44PM -0600, Dan Harris wrote:
> of these operations or a full dump/reload. I do run VACUUM regularly, it's
> just that sometimes we need to go back and update a huge percentage of rows
> in a single batch due to changing customer requirements, leaving us with
> significant table bloat.

Do you need to update those rows in one transaction (i.e. is the
requirement that they all get updated such that the change only
becomes visible at once)? If not, you can do this in batches and
vacuum in between. Batch updates are the prime sucky area in
Postgres.

Another trick, if the table is otherwise mostly static, is to do the
updating in a copy of the table, and then use the transactional DDL
features of postgres to change the table names.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Everything that happens in the world happens at some place.
--Jane Jacobs


From: Dan Harris <fbsd(at)drivefaster(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: reclaiming disk space after major updates
Date: 2007-06-07 21:26:56
Message-ID: 46687820.9000604@drivefaster.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

Andrew Sullivan wrote:
> On Wed, Jun 06, 2007 at 04:04:44PM -0600, Dan Harris wrote:
>> of these operations or a full dump/reload. I do run VACUUM regularly, it's
>> just that sometimes we need to go back and update a huge percentage of rows
>> in a single batch due to changing customer requirements, leaving us with
>> significant table bloat.
>
> Do you need to update those rows in one transaction (i.e. is the
> requirement that they all get updated such that the change only
> becomes visible at once)? If not, you can do this in batches and
> vacuum in between. Batch updates are the prime sucky area in
> Postgres.

They don't always have to be in a single transaction, that's a good idea to
break it up and vacuum in between, I'll consider that. Thanks

>
> Another trick, if the table is otherwise mostly static, is to do the
> updating in a copy of the table, and then use the transactional DDL
> features of postgres to change the table names.

I thought of this, but it seems to break other application logic that feeds a
steady streams of inserts into the tables.

Thanks again for your thoughts. I guess I'll just have to work around this
problem in application logic.


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: reclaiming disk space after major updates
Date: 2007-06-08 14:03:54
Message-ID: 20070608140354.GV17144@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

On Thu, Jun 07, 2007 at 03:26:56PM -0600, Dan Harris wrote:
>
> They don't always have to be in a single transaction, that's a good idea to
> break it up and vacuum in between, I'll consider that. Thanks

If you can do it this way, it helps _a lot_. I've had to do this
sort of thing, and breaking into groups of a couple thousand or so
really made the difference.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
--Scott Morris


From: Dan Harris <fbsd(at)drivefaster(dot)net>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [ADMIN] reclaiming disk space after major updates
Date: 2007-06-08 14:29:24
Message-ID: 466967C4.70606@drivefaster.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

Andrew Sullivan wrote:
> On Thu, Jun 07, 2007 at 03:26:56PM -0600, Dan Harris wrote:
>> They don't always have to be in a single transaction, that's a good idea to
>> break it up and vacuum in between, I'll consider that. Thanks
>
> If you can do it this way, it helps _a lot_. I've had to do this
> sort of thing, and breaking into groups of a couple thousand or so
> really made the difference.
>
> A
>

One more point in my original post.. For my own education, why does VACUUM FULL
prevent reads to a table when running (I'm sure there's a good reason)? I can
certainly understand blocking writes, but if I could still read from it, I'd
have no problems at all!

-Dan


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: [ADMIN] reclaiming disk space after major updates
Date: 2007-06-08 15:10:57
Message-ID: 20070608151057.GG17144@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-performance

On Fri, Jun 08, 2007 at 08:29:24AM -0600, Dan Harris wrote:
>
> One more point in my original post.. For my own education, why does VACUUM
> FULL prevent reads to a table when running (I'm sure there's a good
> reason)? I can certainly understand blocking writes, but if I could still
> read from it, I'd have no problems at all!

It has to take an exclusive lock, because it actually moves the bits
around on disk. Since your SELECT query could be asking for data
that is actually in-flight, you lose. This is conceptually similar
to the way defrag works on old FAT-type filesystems: if you used one,
you'll remember that when you were defragging your disk, if you did
anything else on that disk the defrag would keep restarting. This
was because the OS was trying to move bits around, and when you did
stuff, you screwed up its optimization. The database works
differently, by taking an exclusive lock, but the basic conceptual
problem is the same.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
--Scott Morris