Re: Spreading full-page writes

From: Greg Stark <stark(at)mit(dot)edu>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Spreading full-page writes
Date: 2014-05-27 11:42:52
Message-ID: CAM-w4HPnbzEP0QZrc7ELkAWUEyEmYfGrE0164dEqmt7KhP4a9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>
> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>
>>> Another idea would be to have separate checkpoints for each buffer
>> partition. You would have to start recovery from the oldest checkpoint of
>> any of the partitions.
>
> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I do now. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record for a given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitions is simpler.

Interesting. I just thought of it independently.

Incidentally you wouldn't actually want to use the buffer partitions
per se since the new server might start up with a different number of
partitions. You would want an algorithm for partitioning the block
space that xlog replay can reliably reproduce regardless of the size
of the buffer lock partition table. It might make sense to set it up
so it coincidentally ensures all the buffers being flushed are in the
same partition or maybe the reverse would be better. Probably it
doesn't actually matter.

> For simplicity, let's imagine that we have two Redo-pointers for each checkpoint record: one for even-numbered pages, and another for odd-numbered pages. When checkpoint begins, we first update the Even-redo pointer to the current WAL insert location, and then flush all the even-numbered buffers in the buffer cache. Then we do the same for Odd.

Hm, I had convinced myself that the LSN on the pages would mean you
skip the replay anyways but I think I was wrong and you would need to
keep a bitmap of which partitions were in recovery mode as you replay
and keep adding partitions until they're all in recovery mode and then
keep going until you've seen the checkpoint record for all of them.

I'm assuming you would keep N checkpoint positions in the control
file. That also means we can double the checkpoint timeout with only a
marginal increase in the worst case recovery time. Since the worst
case will be (1 + 1/n)*timeout's worth of wal to replay rather than
2*n. The amount of time for recovery would be much more predictable.

> Recovery begins at the Even-redo pointer. Replay works as normal, but until you reach the Odd-pointer, you refrain from replaying any changes to Odd-numbered pages. After reaching the odd-pointer, you replay everything as normal.
>
> Hmm, that seems actually doable...

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ronan Dunklau 2014-05-27 11:49:03 Re: IMPORT FOREIGN SCHEMA statement
Previous Message Heikki Linnakangas 2014-05-27 09:07:55 Re: Spreading full-page writes