Quick Links

Re: Spreading full-page writes

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: Spreading full-page writes
Date:	2014-05-27 12:15:54
Message-ID:	538481FA.6040707@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 05/27/2014 02:42 PM, Greg Stark wrote:
> On Tue, May 27, 2014 at 10:07 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>>
>> On 05/26/2014 02:26 PM, Greg Stark wrote:
>>>
>>>> Another idea would be to have separate checkpoints for each buffer
>>> partition. You would have to start recovery from the oldest checkpoint of
>>> any of the partitions.
>>
>> Yeah. Simon suggested that when we talked about this, but I didn't understand how that works at the time. I think I do now. The key to making it work is distinguishing, when starting recovery from the latest checkpoint, whether a record for a given page can be replayed safely. I used flags on WAL records in my proposal to achieve this, but using buffer partitions is simpler.
>
> Interesting. I just thought of it independently.
>
> Incidentally you wouldn't actually want to use the buffer partitions
> per se since the new server might start up with a different number of
> partitions. You would want an algorithm for partitioning the block
> space that xlog replay can reliably reproduce regardless of the size
> of the buffer lock partition table. It might make sense to set it up
> so it coincidentally ensures all the buffers being flushed are in the
> same partition or maybe the reverse would be better. Probably it
> doesn't actually matter.

Since you will be flushing the buffers one "redo partition" at a time,
you would want to allow the OS to do merge the writes within a partition
as much as possible. So my even-odd split would in fact be pretty bad.
Some sort of striping, e.g. mapping each contiguous 1 MB chunk to the
same partition, would be better.

> I'm assuming you would keep N checkpoint positions in the control
> file. That also means we can double the checkpoint timeout with only a
> marginal increase in the worst case recovery time. Since the worst
> case will be (1 + 1/n)*timeout's worth of wal to replay rather than
> 2*n. The amount of time for recovery would be much more predictable.

Good point.

- Heikki

In response to

Re: Spreading full-page writes at 2014-05-27 11:42:52 from Greg Stark

Responses

Re: Spreading full-page writes at 2014-05-30 03:39:46 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2014-05-27 12:18:09	Re: Spreading full-page writes
Previous Message	Heikki Linnakangas	2014-05-27 11:54:00	Re: Race condition within _bt_findinsertloc()? (new page split code)