Re: Eliminating VACUUM FULL WAS: remove flatfiles.c

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, daveg(at)sonic(dot)net, Andrew Dunstan <andrew(at)dunslane(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Eliminating VACUUM FULL WAS: remove flatfiles.c
Date: 2009-09-04 20:01:00
Message-ID: 19750.1252094460@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> No problem, just CLUSTER that table same as today.

> Uh, no, that was Josh's point: you can't CLUSTER pg_class, because you
> can't change its relfilenode. If you do, backends won't know where to
> read pg_class to find out its relfilenode.

> I was wondering whether maintenance operations like "vacuum rewrite"
> could get away with filling a new table file and then moving it into
> place with rename(2), which is guaranteed atomic (at least on sane
> filesystems). The idea doesn't work right off because (1) you need
> to atomically install the updated indexes too, and (2) the table
> might span more than one segment file. But maybe we could think of
> something.

Hmm ... reading that over again, it seems like there is a pretty
obvious solution. The sticking point --- not only for pg_class,
but for shared catalogs such as pg_database --- is the lack of a
way to track relfilenode if it ever changes. What if we keep
the relfilenode of these critical tables someplace else? For
instance, we could have a "map" file in each database holding
the relfilenode of pg_class, and one in $PGDATA/global holding
the relfilenodes of the shared catalogs and indexes. It'd be
possible to update a map file atomically via the rename(2) trick.
Then we teach relcache or some similar place to believe the map
files over the contents of pg_class.

This looks sort of like a reversion to flat files, but it has a couple
of saving graces:
1. The set of entries is fixed and small, so there's no performance
issue looming with database growth.
2. We could not synchronize updates with transaction commit, which was
always the real Achilles' heel of the flat files. But I think we don't
need to, if we restrict the set of operations that can change the
relfilenodes of critical tables to "maintenance" operations that only
rewrite the table contents and don't make any logical changes in the
contents. Given that restriction, transaction commit isn't actually
important; it will be the act of moving the updated map file into place
that effectively is commit for these operations. If you crash after
that, your change is still effective.

Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Grzegorz Jaskiewicz 2009-09-04 20:29:42 Re: clang's static checker report.
Previous Message Tom Lane 2009-09-04 19:36:37 Re: Eliminating VACUUM FULL WAS: remove flatfiles.c