Re: Post-mortem: final 2PC patch

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)surnet(dot)cl>, pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Post-mortem: final 2PC patch
Date: 2005-06-18 23:00:25
Message-ID: Pine.OSF.4.61.0506190139490.262699@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Sat, 18 Jun 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> Can we figure out another way to solve the race condition? Would it
>> in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
>> considering that it usually wouldn't be held for long, since usually the
>> checkpoint would have very little work to do?
>
> If you're concerned about throughput of 2PC xacts then we can't sit on
> the TwoPhaseStateLock while doing I/O; that will block both preparation
> and commital of all 2PC xacts for a pretty long period in CPU terms.
>
> Here's a sketch of an idea inspired by your comment above:
>
> 1. In each gxact in shared memory, store the WAL offset of the PREPARE
> record, which we will know before we are ready to mark the gxact
> "valid".
>
> 2. When CheckPointTwoPhase runs (which we'll put near the end of the
> checkpoint sequence), the only gxacts that need to be fsync'd are those
> that are marked valid and have a PREPARE WAL location older than the
> checkpoint's redo horizon (anything newer will be replayed from WAL on
> crash, so it doesn't need fsync to complete the checkpoint). If you're
> right that the lifespan of a state file is often shorter than the time
> needed for a checkpoint, this wins big. In any case we'll never have to
> fsync state files that disappear before the next checkpoint.
>
> 3. One way to handle CheckPointTwoPhase is:
>
> * At start, take TwoPhaseStateLock (can be in shared mode) for just long
> enough to scan the gxact list and make a list of the XID of things that
> need fsync per above rule.
>
> * Without the lock, try to open and fsync each item in the list.
> Success: remove from list
> ENOENT failure on open: add to list of not-there failures
> Any other failure: ereport(ERROR)
>
> * If the failure list is not empty, again take TwoPhaseStateLock in
> shared mode, and check that each of the failures is now gone (or at
> least marked invalid); if so it's OK, otherwise ereport the ENOENT
> error.

In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is
marked as valid after the prepare record is written to WAL. If checkpoint
runs after the WAL record is written but before the gxact is marked as
valid, it doesn't get fsynced. Right?

Otherwise, looks good to me.

> Another possibility is to further extend the locking protocol for gxacts
> so that the checkpointer can lock just the item it is fsyncing (which is
> not possible at the moment because the checkpointer hasn't got an XID,
> but probably we could think of another approach). But that would
> certainly delay attempts to commit the item being fsync'd, whereas the
> above approach might not have to do so, depending on the filesystem
> implementation.

The above sketch is much better.

> Now there's a small problem with this approach, which is that we cannot
> store the PREPARE WAL record location in the state files, since the
> state file has to be completely computed before writing the WAL record.
> However, we don't really need to do that: during recovery of a prepared
> xact we know the thing has been fsynced (either originally, or when we
> rewrote it during the WAL recovery sequence --- we can force an
> immediate fsync in that one case). So we can just put zero, or maybe
> better the current end-of-WAL location, into the reconstructed gxact in
> memory.

This reminds me of something. What should we do about XID wraparounds and
prepared transactions? Should we have some mechanism to freeze prepared
transactions, like heap tuples? At the minimum, I think we should issue a
warning if the xid counter approaches the oldest prepared transaction.

A transaction shouldn't live that long in normal use, but I can imagine an
orphaned transaction sitting there for years if it doesn't hold any locks
etc that bother other applications.

I don't think we should implement heuristic commit/rollback, though.
That creates a whole new class of problems.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-06-18 23:16:44 Re: Post-mortem: final 2PC patch
Previous Message Tom Lane 2005-06-18 22:29:47 Re: hashtable crash (was Re: [PATCHES] Post-mortem: final 2PC patch)

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2005-06-18 23:16:44 Re: Post-mortem: final 2PC patch
Previous Message Tom Lane 2005-06-18 22:29:47 Re: hashtable crash (was Re: [PATCHES] Post-mortem: final 2PC patch)