WIP: relation metapages

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: WIP: relation metapages
Date: 2012-06-14 15:01:45
Message-ID: CA+Tgmoa13Ou22KU5bYT6hwArqH=cXRNEph4qyOfaQ6qqM4JfbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's a WIP patch implementing metapages for all relations, somewhat
along lines previously discussed:

http://archives.postgresql.org/pgsql-hackers/2012-05/msg00860.php

It turns out that doing this for indexes was pretty easy and didn't
obviously break anything; doing it for heaps was harder and broke a
lot of stuff. If you apply the patch as attached here, you'll find
that we fail a whole bunch of regression tests, mostly due to plan
changes. It seems that having N+1 pages in the heap changes the
optimal way to do... everything. Of course, the extra page need not
be included in seq-scans, so you'd think this was mostly a matter of
adjusting the costing functions to reduce the number of pages by 1 for
costing purposes. However, so far I haven't been able to hack the
costing to make the plan changes go away, though, which may be a sign
that I've broken something else. I can't seem to make Merge Append
work at all, which is maybe a better sign that I've broken something.
If you want to see the patch pass regression tests, hack
heap_create_storage not to emit a metapage for heaps and all the
regression test failures disappear.

What I'm really looking for at this stage of the game is feedback on
the design decisions I made. The intention here is that it should be
possible to read old-format heaps and indexes transparently, but that
when we create or rewrite a relation, we add a new-style metapage.
For all index types except gist, this is really just a format change
for the metapage that already existed: the new data that gets stored
for all relation types is added at the beginning of the page, just
following the page header, and then the AM-specific stuff is moved
further down the page. For GiST, it means adding a metapage that
wasn't there before, but that went smoothly too. For some AMs, I had
to rejigger the WAL-logging a little; review of those changes would be
good. The basic idea is that we don't want to have to try to
reconstruct what the metapage should have been during recovery
(indeed, we can't) so we just log an image of the page instead.

For heaps, I refactored things so that heap_create() is no longer used
for indexes. Instead, index_create() calls RelationBuildLocalRelation
directly. This required moving a little bit of logic from
heap_create() into RelationBuildLocalRelation(), but it seems like it
may fit better there anyway. That means that heap_create() can now
assume that it's creating a heap and not an index. This refactoring
might be worth pulling out of the patch and committing separately,
since I think the result is actually simpler and cleaner than what
we're doing now; but it's a minor point in any case.

I put the new metapage code in src/backend/access/common/metapage.c,
but I don't have a lot of confidence that that's the appropriate
location for it. Suggestions are appreciated.

I am pretty sure that clustering a relation will cause it to end up
with the wrong relation ID in its metapage afterwards. Since nothing
relies on that information at this point, this shouldn't break
anything, but it needs to be fixed eventually.

I think the thing I'm most worried about is the plan changes that
result from adding heap metapages. Suggestions on what to do about
that from a costing perspective would be particularly appreciated.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
metapage-v1.patch application/octet-stream 80.8 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: relation metapages
Date: 2012-06-14 21:34:05
Message-ID: 4FDA58CD.9050801@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.06.2012 18:01, Robert Haas wrote:
> What I'm really looking for at this stage of the game is feedback on
> the design decisions I made. The intention here is that it should be
> possible to read old-format heaps and indexes transparently, but that
> when we create or rewrite a relation, we add a new-style metapage.

That dodges the problem of pg_upgrade, but eventually, we will want to
use the metapage for something important, and it can't be optional at
that point anymore. So we will eventually need to deal with pg_upgrade
somehow.

> I put the new metapage code in src/backend/access/common/metapage.c,
> but I don't have a lot of confidence that that's the appropriate
> location for it. Suggestions are appreciated.

That seems good to me.

It would be nice to have the oid of the access method in the metapage
(or some other way to identify it).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: relation metapages
Date: 2012-06-15 00:20:04
Message-ID: CA+TgmoazFS8kvpYBCgLpcUSSC+4cC-Ezv9mbjea2ne7RHvL1SA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 5:34 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 14.06.2012 18:01, Robert Haas wrote:
>> What I'm really looking for at this stage of the game is feedback on
>> the design decisions I made.  The intention here is that it should be
>> possible to read old-format heaps and indexes transparently, but that
>> when we create or rewrite a relation, we add a new-style metapage.
>
> That dodges the problem of pg_upgrade, but eventually, we will want to use
> the metapage for something important, and it can't be optional at that point
> anymore. So we will eventually need to deal with pg_upgrade somehow.

Well, the code as I've written deals with pg_upgrade just fine: you
can move your old relation files over and they still work. What's
missing at present is an efficient way to convert them to the new
format. If you don't mind the inefficiency, you can use VACUUM FULL
or CLUSTER, and you're there, but obviously we'll want something
better before we start relying on this for anything too critical. For
indexes, it should be pretty trivial to reduce the requirement from
"rewrite while holding AccessExclusiveLock" to "brief
AccessExclusiveLock". Everything except GiST already has a metapage,
so you just need to rewrite that page in the new format, which is a
SMOP. GiST requires moving the existing metapage out to a free page
(or a new page) and writing a metapage pointing to it into block 0,
which is again pretty simple. Heaps are a bit harder. We could adopt
your idea of storing a block number in the metablock; any index TIDs
that point to block 0 will be interpreted as pointing to that block
instead. Then we can basically just relocate block to any free block,
or a new one, as with GiST. Alternatively, we can read the tuples in
block 0 and reinsert them; vacuum; and then repurpose block 0. Taking
it even further, we could try to do better than "brief
AccessExclusiveLock". That might be possible too, especially for
indexes, but I'm not sure that it's necessary, and I haven't thought
through the details.

Even if we just get it down to "brief AccessExclusiveLock", I think we
might also be able to improve the experience by making autovacuum do
conversions automatically. So, if you pg_upgrade, the first
autovacuum worker that spins through the database will
ConditionalLockAcquire an AccessExclusiveLock on each relation in turn
and try to do the conversion. If it can't get the lock it'll keep
retrying until it succeeds. Odds are good that for most people this
would make the addition of the metapage completely transparent. On
the other hand, if metapages exist mostly to support operations like
making an unlogged table logged or visca versa, that's not really
necessary: we can add the metapage when someone performs a DDL
operation that requires it. There is a lot of optimization possible
on top of the basic mechanism, but how much of it makes sense to do,
and which parts, depends on exactly which of the many things we could
do with this we end up deciding to actually do. I'm trying to start
with the basics, which means getting the basic infrastructure in place
and working.

> It would be nice to have the oid of the access method in the metapage (or
> some other way to identify it).

Yeah, I was thinking about that. I think a magic number might be
preferable to an OID, and we actually already have that as the first
4-bytes of the access method metadata for all index types except GIN.
I'm thinking about trying to fix up GIN so that it adds one as well;
the trick is to do it in a way that is backward-compatible, which I
have an idea how to do but haven't tackled yet. We can add a magic
number for heaps as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company