heap metapages

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: heap metapages
Date: 2012-05-21 17:56:52
Message-ID: CA+TgmoZ9zn-eaa9Vt7zTd13G1JujcQtaXwG9xNAsm9EyVEs8UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At dinner on Friday night at PGCon, the end of the table that included
Tom Lane, Stephen Frost, and myself got to talking about the idea of
including some kind of metapage in every relation, including heap
relations. At least some index relations already have something like
this (cf _bt_initmetapage, _hash_metapinit). I believe that adding
this for all relations, including heaps, would allow us to make
improvements in several areas.

1. Tom was interested in the idea of trying to make the system catalog
entries which describe the system catalogs themselves completely
immutable, so that they can potentially be shared between databases.
For example, we might have shared catalogs pg_class_shared and
pg_attribute_shared, describing the structure of all the system
catalogs; and then we might also have pg_class and pg_attribute within
each database, describing the structure of tables which exist only
within that database. Right now, this is not possible, because values
like relpages, reltuples, and relfrozenxid can vary from database to
database. However, if those values were stored in a metapage
associated with the heap relation rather than in the system catalogs,
then potentially we could make this work. The most obvious benefit of
this is that it would reduce the on-disk footprint of a new database,
but there are other possible benefits as well. For example, a process
not bound to a database could read a shared catalog even if it weren't
nailed, and if we ever implement a prefork system for backends, they'd
be able to do more of their initialization steps before learning which
database they were to target.

2. I'm interested in having a cleaner way to associate
non-transactional state with a relation. This has come up a few
times. We currently handle this by having lazy VACUUM do in-place
heap updates to replace values like relpages, reltuples, and
relfrozenxid, but this feels like a kludge. It's particularly scary
to think about relying on this for anything critical given that
non-inplace heap updates can be happening simultaneously, and the
consequences of losing an update to relfrozenxid in particular are
disastrous. Plus, it requires hackery in pg_upgrade to preserve the
value between the old and new clusters; we've already had to fix two
data-destroying bugs in that logic. There are several other things
that we might want to do that have similar requirements. For example,
Pavan's idea of folding VACUUM's second heap pass into the next vacuum
cycle requires a relation-wide piece of state which can probably be
represented as a single bit, but putting that bit in pg_class would
require the same sorts of hacks there that we already have for
relfrozenxid, with similar consequences if it's not properly
preserved. Making unlogged tables logged or the other way around
appears to require some piece of relation-level state *that can be
accessed during recovery*, and pg_class is not going to work for that.
Page checksums have a similar requirement if the granularity for
turning them on and off is anything less than the entire cluster.
Whenever we decide to roll out a new page version, we'll want a place
to record the oldest page version that might be present in a
particular relation, so that we can easily check whether a cluster can
be upgraded to a new release that has dropped support for an old page
version. Having a common framework for all of these things seems like
it will probably be easier than solving each problem individually, and
a metapage is a good place to store non-transactional state.

3. Right now, a new table uses up a minimum of 3 inodes, even if it
has no indexes: one for the main fork, one for the visibility map, and
one for the free space map. For people who have lots and lots of
little tiny tables, this is quite inefficient. The amount of
information we'd have to store in a heap metapage would presumably not
be very big, so we could potentially move the first, say, 1K of the
visibility map into the heap metapage, meaning that tables less than
64MB would no longer require a separate visibility map fork.
Something similar could possibly be done with the free-space map,
though I am unsure of the details. Right now, a relation containing
just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3
for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB
and 1 inode would be very nice. The case of a completely-empty
relation is a bit annoying; that right now takes 1 inode and 0 blocks
and I suspect we'd end up with 1 inode and 1 block, but I think it
might still be a win overall.

4. Every once in a while, somebody's database ends up in pieces in
lost+found. We could make this a bit easier to recover from by
including the database OID, relfilenode, and table OID in the
metapage. This wouldn't be perfect, since a relation over one GB
would still only have one metapage, so additional relation segments
would still be a problem. But it would be still be a huge improvement
over the status quo: some very large percentage of the work of putting
everything back where it goes could probably be done by a Perl script
that read all the metapages, and if you needed to know, say, which
file contained pg_class, that would be a whole lot easier, too.

Now, there are a couple of obvious problems here, the biggest of which
is probably that we want to avoid breaking pg_upgrade. I don't have a
great solution to that problem. The most realistic option I can think
of at the moment is to fudge things so that existing features can
continue to work even if the metapage isn't present. Any table
rewrite would add a metapage; if you want to use a new feature that
requires a metapage, you have to rewrite the table first to get one.
However, that's pretty unfortunate in terms of goal #1 and some parts
of goal #2, because if you can't be certain of having the metapage
present then you can't really store data there in lieu of pg_class;
the best you'll be able to do is have it both places, at least until
you're ready to deprecate upgrades from releases that don't contain
metapage support. Hopefully someone has a better idea...

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-21 17:59:02 Re: read() returns ERANGE in Mac OS X
Previous Message Josh Berkus 2012-05-21 17:44:59 Re: Why is indexonlyscan so darned slow?