WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: WAL format and API changes (9.5)
Date: 2014-04-03 14:14:23
Message-ID: 533D6CBF.6080203@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'd like to do some changes to the WAL format in 9.5. I want to annotate
each WAL record with the blocks that they modify. Every WAL record
already includes that information, but it's done in an ad hoc way,
differently in every rmgr. The RelFileNode and block number are
currently part of the WAL payload, and it's the REDO routine's
responsibility to extract it. I want to include that information in a
common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing
which blocks a WAL record modifies. One such tool is pg_rewind; it
currently has to understand every WAL record the backend writes. There's
also a tool out there called pg_readahead, which does prefetching of
blocks accessed by WAL records, to speed up PITR. I don't think that
tool has been actively maintained, but at least part of the reason for
that is probably that it's a pain to maintain when it has to understand
the details of every WAL record type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. The
boilerplate code in all WAL redo routines, and writing WAL records,
could be simplified.

So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved in
the WAL-logged operation are explicitly registered, by calling a new
XLogRegisterBuffer function. Currently, buffers that need full-page
images are registered by including them in the XLogRecData chain, but
with the new system, you call the XLogRegisterBuffer() function instead.
And you call that function for every buffer involved, even if no
full-page image needs to be taken, e.g because the page is going to be
recreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber of
the modified pages in the WAL payload. That information is automatically
included in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order the
buffers appear in XLogRecData entries. With the new API, the blocks are
numbered explicitly. This is more convenient when a WAL record sometimes
modifies a buffer and sometimes not. For example, a B-tree split needs
to modify four pages: the original page, the new page, the right sibling
(unless it's the rightmost page) and if it's an internal page, the page
at the lower level whose split the insertion completes. So there are two
pages that are sometimes missing from the record. With the new API, you
can nevertheless always register e.g. original page as buffer 0, new
page as 1, right sibling as 2, even if some of them are actually
missing. SP-GiST contains even more complicated examples of that.

The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used
in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also
registering buffers that don't need full-page images, and perhaps also a
function for registering a page that *always* needs a full-page image,
regardless of the LSN. A few existing WAL record types just WAL-log the
whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:

/* register the buffer with the WAL record, with ID 0 */
XLogRegisterBuffer(0, buf, true);

rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(BlahRecord);
rdata[0].buffer_id = -1; /* -1 means the data is always included */
rdata[0].next = &(rdata[1]);

rdata[1].data = (char *) mydata;
rdata[1].len = mydatalen;
rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above */
rdata[1].next = NULL

...
recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

PageSetLSN(buf, recptr);

(While we're at it, perhaps we should let XLogInsert set the LSN of all
the registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list
or array. I'm not wedded on the details...)

Redo
----

There are four different states a block referenced by a typical WAL
record can be in:

1. The old page does not exist at all (because the relation was
truncated later)
2. The old page exists, but has an LSN higher than current WAL record,
so it doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

/* If we have a full-page image, restore it and we're done */
if (HasBackupBlock(record, 0))
{
(void) RestoreBackupBlock(lsn, record, 0, false, false);
return;
}
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
/* If the page was truncated away, we're done */
if (!BufferIsValid(buffer))
return;

page = (Page) BufferGetPage(buffer);

/* Has this record already been replayed? */
if (lsn <= PageGetLSN(page))
{
UnlockReleaseBuffer(buffer);
return;
}

/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:

if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

WAL format
----------

The registered block references need to be included in the WAL record.
We already do that for backup blocks, so a naive implementation would be
to just include a BkpBlock struct for all the block references, even
those that don't need a full-page image. That would be rather bulky,
though, so that needs some optimization. Shouldn't be difficult to omit
duplicated/unnecessary information, and add a flags field indicating
which fields are present. Overall, I don't expect there to be any big
difference in the amount of WAL generated by a typical application.

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-04-03 14:15:33 Re: quiet inline configure check misses a step for clang
Previous Message Tom Lane 2014-04-03 13:46:37 Re: GSoC proposal - "make an unlogged table logged"