Re: Freezing without write I/O

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Freezing without write I/O
Date: 2013-08-27 15:56:15
Message-ID: 521CCC1F.6040904@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.06.2013 21:58, Heikki Linnakangas wrote:
> On 01.06.2013 23:21, Robert Haas wrote:
>> On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>>> We define a new page-level bit, something like PD_RECENTLY_FROZEN.
>>>> When this bit is set, it means there are no unfrozen tuples on the
>>>> page with XIDs that predate the current half-epoch. Whenever we know
>>>> this to be true, we set the bit. If the page LSN crosses more than
>>>> one half-epoch boundary at a time, we freeze the page and set the bit.
>>>> If the page LSN crosses exactly one half-epoch boundary, then (1) if
>>>> the bit is set, we clear it and (2) if the bit is not set, we freeze
>>>> the page and set the bit.
>>>
>>> Yep, I think that would work. Want to write the patch, or should I? ;-)
>>
>> Have at it.
>
> Here's a first draft. A lot of stuff missing and broken, but "make
> check" passes :-).
>
> In the patch, instead of working with "half-epochs", there are "XID-LSN
> ranges", which can be of arbitrary size. An XID-LSN range consists of
> three values:
>
> minlsn: The point in WAL where this range begins.
> minxid - maxxid: The range of XIDs allowed in this range.
>
> Every point in WAL belongs to exactly one range. The minxid-maxxid of
> the ranges can overlap. For example:
>
> 1. XIDs 25000942 - 27000003 LSN 0/3BB9938
> 2. XIDs 23000742 - 26000003 LSN 0/2AB9288
> 3. XIDs 22000721 - 25000003 LSN 0/1AB8BE0
> 4. XIDs 22000002 - 24000003 LSN 0/10B1550
>
> The invariant with the ranges is that a page with a certain LSN is only
> allowed to contain XIDs that belong to the range specified by that LSN.
> For example, if a page has LSN 0/3500000, it belongs to the 2nd range,
> and can only contain XIDs between 23000742 - 26000003. If a backend
> updates the page, so that it's LSN is updated to, say, 0/3D12345, all
> XIDs on the page older than 25000942 must be frozen first, to avoid
> violating the rule.
>
> The system keeps track of a small number of these XID-LSN ranges. Where
> we currently truncate clog, we can also truncate the ranges with maxxid
> < the clog truncation point. Vacuum removes any dead tuples and updates
> relfrozenxid as usual, to make sure that there are no dead tuples or
> aborted XIDs older than the minxid of the oldest tracked XID-LSN range.
> It no longer needs to freeze old committed XIDs, however - that's the
> gain from this patch (except to uphold the invariant, if it has to
> remove some dead tuples on the page and update its LSN).
>
> A new range is created whenever we reach the maxxid on the current one.
> The new range's minxid is set to the current global oldest xmin value,
> and maxxid is just the old maxxid plus a fixed number (1 million in the
> patch, but we probably want a higher value in reality). This ensures
> that if you modify a page and update its LSN, all the existing XIDs on
> the page that cannot be frozen yet are greater than the minxid of the
> latest range. In other words, you can always freeze old XIDs on a page,
> so that any remaining non-frozen XIDs are within the minxid-maxxid of
> the latest range.
>
> The HeapTupleSatisfies functions are modified to look at the page's LSN
> first. If it's old enough, it doesn't look at the XIDs on the page level
> at all, and just considers everything on the page is visible to everyone
> (I'm calling this state a "mature page").
>
>> I think the tricky part is going to be figuring out the
>> synchronization around half-epoch boundaries.
>
> Yep. I skipped all those difficult parts in this first version. There
> are two race conditions that need to be fixed:
>
> 1. When a page is updated, we check if it needs to be frozen. If its LSN
> is greater than the latest range's LSN. IOW, if we've already modified
> the page, and thus frozen all older tuples, within the current range.
> However, it's possible that a new range is created immediately after
> we've checked that. When we then proceed to do the actual update on the
> page and WAL-log that, the new LSN falls within the next range, and we
> should've frozen the page. I'm planning to fix that by adding a "parity
> bit" on the page header. Each XID-LSN range is assigned a parity bit, 0
> or 1. When we check if a page needs to be frozen on update, we make note
> of the latest range's parity bit, and write it in the page header.
> Later, when we look at the page's LSN to determine which XID-LSN range
> it belongs to, we compare the parity. If the parity doesn't match, we
> know that the race condition happened, so we treat the page to belong to
> the previous range, not the one it would normally belong to, per the LSN.
>
> 2. When we look at a page, and determine that it's not old enough to be
> "matured", we then check the clog as usual. A page is considered mature,
> if the XID-LSN range (and corresponding clog pages) has already been
> truncated away. It's possible that between those steps, the XID-LSN
> range and clog is truncated away, so that the backend tries to access a
> clog page that doesn't exist anymore. To fix that, the XID-LSN range and
> clog truncation needs to be done in two steps. First, mark the
> truncation point in shared memory. Then somehow wait until all backends
> see the new value, and go ahead with actually truncating the clog only
> after that.
>
> Aside from those two race conditions, there are plenty of scalability
> issues remaining. Currently, the shared XID-LSN range array is checked
> every time a page is accessed, so it could quickly become a bottleneck.
> Need to cache that information in each backend. Oh, and I didn't
> implement the PD_RECENTLY_FROZEN bit in the page header yet, so you will
> get a freezing frenzy right after a new XID-LSN range is created.
>
> I'll keep hacking away on those things, but please let me know if you
> see some fatal flaw with this plan.

Here's an updated patch. The race conditions I mentioned above have been
fixed.

This is still definitely work-in-progress, but overall I think this is
quite promising. The patch basically works, although there are a bunch
of TODO items like handling unlogged tables.

This patch is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
"freeze-by-xid-lsn-ranges".

- Heikki

Attachment Content-Type Size
xid-lsn-ranges-2.patch.gz application/x-gzip 28.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2013-08-27 16:09:45 Re: Extension Templates S03E11
Previous Message Alvaro Herrera 2013-08-27 15:32:30 Re: changeset generation v5-01 - Patches & git tree