Re: Eliminating VACUUM FULL WAS: remove flatfiles.c

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, daveg(at)sonic(dot)net, Andrew Dunstan <andrew(at)dunslane(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Eliminating VACUUM FULL WAS: remove flatfiles.c
Date: 2009-09-05 01:28:17
Message-ID: 603c8f070909041828s80f6cf3kf3f7de7ca3a07fe2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 4, 2009 at 4:01 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>>> No problem, just CLUSTER that table same as today.
>
>> Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you
>> can't change its relfilenode.  If you do, backends won't know where to
>> read pg_class to find out its relfilenode.
>
>> I was wondering whether maintenance operations like "vacuum rewrite"
>> could get away with filling a new table file and then moving it into
>> place with rename(2), which is guaranteed atomic (at least on sane
>> filesystems).  The idea doesn't work right off because (1) you need
>> to atomically install the updated indexes too, and (2) the table
>> might span more than one segment file.  But maybe we could think of
>> something.
>
> Hmm ... reading that over again, it seems like there is a pretty
> obvious solution.  The sticking point --- not only for pg_class,
> but for shared catalogs such as pg_database --- is the lack of a
> way to track relfilenode if it ever changes.  What if we keep
> the relfilenode of these critical tables someplace else?  For
> instance, we could have a "map" file in each database holding
> the relfilenode of pg_class, and one in $PGDATA/global holding
> the relfilenodes of the shared catalogs and indexes.  It'd be
> possible to update a map file atomically via the rename(2) trick.
> Then we teach relcache or some similar place to believe the map
> files over the contents of pg_class.
>
> This looks sort of like a reversion to flat files, but it has a couple
> of saving graces:
> 1. The set of entries is fixed and small, so there's no performance
> issue looming with database growth.
> 2. We could not synchronize updates with transaction commit, which was
> always the real Achilles' heel of the flat files.  But I think we don't
> need to, if we restrict the set of operations that can change the
> relfilenodes of critical tables to "maintenance" operations that only
> rewrite the table contents and don't make any logical changes in the
> contents.  Given that restriction, transaction commit isn't actually
> important; it will be the act of moving the updated map file into place
> that effectively is commit for these operations.  If you crash after
> that, your change is still effective.

This doesn't seem totally horrible. But, before you go do it, do we
have a clearly-defined plan for the rest of the project? Because we
only need this if we're absolutely confident that rewriting the table
in place is just not an option worth keeping around. It's unclear to
me that everyone is convinced of that, and even if they are, it's
unclear to me what we plan to implement instead.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-09-05 01:37:34 Re: Eliminating VACUUM FULL WAS: remove flatfiles.c
Previous Message Robert Haas 2009-09-05 01:02:10 match_unsorted_outer() vs. cost_nestloop()