Re: WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-04-03 15:58:23
Message-ID: 533D851F.3070608@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/03/2014 06:37 PM, Tom Lane wrote:
> Also, IIRC there are places that WAL-log full pages that aren't in a
> shared buffer at all (btree build does this I think). How will that fit
> into this model?

Hmm. We could provide a function for registering a block with given
content, without a Buffer. Something like:

XLogRegisterPage(int id, RelFileNode, BlockNumber, Page)

>> Let's simplify that, and have one new function, XLogOpenBuffer, which
>> returns a return code that indicates which of the four cases we're
>> dealing with. A typical redo function looks like this:
>
>> if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
>> {
>> /* Modify the page */
>> ...
>
>> PageSetLSN(page, lsn);
>> MarkBufferDirty(buffer);
>> }
>> if (BufferIsValid(buffer))
>> UnlockReleaseBuffer(buffer);
>
>> The '0' in the XLogOpenBuffer call is the ID of the block reference
>> specified in the XLogRegisterBuffer call, when the WAL record was created.
>
> +1, but one important step here is finding the data to be replayed.
> That is, a large part of the complexity of replay routines has to do
> with figuring out which parts of the WAL record were elided due to
> full-page-images, and locating the remaining parts. What can we do
> to make that simpler?

We can certainly add more structure to the WAL records, but any extra
information you add will make the records larger. It might be worth it,
and would be lost in the noise for more complex records like page
splits, but we should keep frequently-used records like heap insertions
as lean as possible.

> Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
> also calculate and hand back the address/size of the logged data that
> had been pointed to by the associated XLogRecData chain item. The
> trouble here is that there might've been multiple XLogRecData items
> pointing to the same buffer. Perhaps the magic ID number you give to
> XLogOpenBuffer should be thought of as identifying an XLogRecData chain
> item, not so much a buffer? It's fairly easy to see what to do when
> there's just one chain item per buffer, but I'm not sure what to do
> if there's more than one.

Hmm. You could register a separate XLogRecData chain for each buffer.
Along the lines of:

rdata[0].data = data for buffer
rdata[0].len = ...
rdata[0].next = &rdata[1];
rdata[1].data = more data for same buffer
rdata[1].len = ...
rdata[2].next = NULL;

XLogRegisterBuffer(0, buffer, &data[0]);

At replay:

if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY)
{
/* xldata points to the data registered for this buffer */
}

Plus one more chain for the data not associated with a buffer.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-04-03 16:11:10 Re: WAL format and API changes (9.5)
Previous Message Andrew Dunstan 2014-04-03 15:51:09 Re: It seems no Windows buildfarm members are running find_typedefs