cheaper snapshots

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: cheaper snapshots
Date: 2011-07-28 02:51:49
Message-ID: CA+TgmoaAjiq=d=kYt3qNj+Uvi+MB-aRovCwr75Ca9egx-Ks9Ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wonder whether we could do something involving WAL properties --- the
> current tuple visibility logic was designed before WAL existed, so it's
> not exploiting that resource at all.  I'm imagining that the kernel of a
> snapshot is just a WAL position, ie the end of WAL as of the time you
> take the snapshot (easy to get in O(1) time).  Visibility tests then
> reduce to "did this transaction commit with a WAL record located before
> the specified position?".  You'd need some index datastructure that made
> it reasonably cheap to find out the commit locations of recently
> committed transactions, where "recent" means "back to recentGlobalXmin".
> That seems possibly do-able, though I don't have a concrete design in
> mind.

I was mulling this idea over some more (the same ideas keep floating
back to the top...). I don't think an LSN can actually work, because
there's no guarantee that the order in which the WAL records are
emitted is the same order in which the effects of the transactions
become visible to new snapshots. For example:

1. Transaction A inserts its commit record, flushes WAL, and begins
waiting for sync rep.
2. A moment later, transaction B sets synchronous_commit=off, inserts
its commit record, requests a background WAL flush, and removes itself
from the ProcArray.
3. Transaction C takes a snapshot.

Sync rep doesn't create this problem; there's a race anyway. The
order of acquisition for WALInsertLock needn't match that for
ProcArrayLock. This has the more-than-slightly-odd characteristic
that you could end up with a snapshot on the master that can see A but
not B and a snapshot on the slave that can see B but not A.

But having said that an LSN can't work, I don't see why we can't just
use a 64-bit counter. In fact, the predicate locking code already
does something much like this, using an SLRU, for serializable
transactions only. In more detail, what I'm imagining is an array
with 4 billion entries, one per XID, probably broken up into files of
say 16MB each with 2 million entries per file. Each entry is a 64-bit
value. It is 0 if the XID has not yet started, is still running, or
has aborted. Otherwise, it is the commit sequence number of the
transaction. For reasons I'll explain below, I'm imagining starting
the commit sequence number counter at some very large value and having
it count down from there. So the basic operations are:

- To take a snapshot, you just read the counter.
- To commit a transaction which has an XID, you read the counter,
stamp all your XIDs with that value, and decrement the counter.
- To find out whether an XID is visible to your snapshot, you look up
the XID in the array and get the counter value. If the value you read
is greater than your snapshot value, it's visible. If it's less, it's
not.

Now, is this algorithm any good, and how little locking can we get away with?

It seems to me that if we used an SLRU to store the array, the lock
contention would be even worse than it is under our current system,
wherein everybody fights over ProcArrayLock. A system like this is
going to involve lots and lots of probes into the array (even if we
build a per-backend cache of some kind) and an SLRU will require at
least one LWLock acquire and release per probe. Some kind of locking
is pretty much unavoidable, because you have to worry about pages
getting evicted from shared memory. However, what if we used a set of
files (like SLRU) but mapped them separately into each backend's
address space? I think this would allow both loads and stores from
the array to be done unlocked. One fly in the ointment is that 8-byte
stores are apparently done as two 4-byte stores on some platforms.
But if the counter runs backward, I think even that is OK. If you
happen to read an 8 byte value as it's being written, you'll get 4
bytes of the intended value and 4 bytes of zeros. The value will
therefore appear to be less than what it should be. However, if the
value was in the midst of being written, then it's still in the midst
of committing, which means that that XID wasn't going to be visible
anyway. Accidentally reading a smaller value doesn't change the
answer.

Committing will require a lock on the counter.

Taking a snapshot can be done unlocked if (1) 8-byte reads are atomic
and either (2a) the architecture has strong memory ordering (no
store/store reordering) or (2b) you insert a memory fence between
stamping the XIDs and decrementing the counter. Otherwise, taking a
snapshot will also require a lock on the counter.

Once a particular XID precedes RecentGlobalXmin, you no longer care
about the associated counter value. You just need to know that it
committed; the order no longer matters. So after a crash, assuming
that you have the CLOG bits available, you can just throw away all the
array contents and start the counter over at the highest possible
value. And, as RecentGlobalXmin advances, you can prune away (or
recycle) files that are no longer needed. In fact, you could put all
of these files on a ramfs, or maybe use shared memory segments for
them.

All that having been said, even if I haven't made any severe
conceptual errors in the above, I'm not sure how well it will work in
practice. On the plus side, taking a snapshot becomes O(1) rather
than O(MaxBackends) - that's good. On the further plus side, you can
check both whether an XID has committed and whether it's visible to
your snapshot in a single, atomic action with no lock - that seems
really good. On the minus side, checking an xid against your snapshot
now has less locality of reference. And, rolling over into a new
segment of the array is going to require everyone to map it, and maybe
cause some disk I/O as a new file gets created.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-07-28 03:28:00 Re: Ripping out pg_restore's attempts to parse SQL before sending it
Previous Message Tom Lane 2011-07-28 00:30:49 Ripping out pg_restore's attempts to parse SQL before sending it