Lazy Snapshots

From: "simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Lazy Snapshots
Date: 2009-08-18 10:13:08
Message-ID: 1278436501.36016.1250590388532.JavaMail.open-xchange@oxltgw02.schlund.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

One of the problems with Hot Standby is that a long running query on the standby
can conflict with VACUUMed rows on the primary, causing queries to be cancelled.

I've been looking at this problem for about a year now from various angles. Jeff
Jane's recent thoughts on procarray scalability have led me down an interesting
path, described here. Taken together, this has led me to rethink completely the
strategy used for avoiding conflicts in the Hot Standby patch.

Currently, we take eager snapshots, meaning we take a snapshot at the start of
each statement whether or not it is necessary. Snapshots exist to disambiguate
the running state of recent transactions, so if a statement never sees data
written by recent transactions then we will never actually use the snapshot.

Another way of doing this is to utilize lazy snapshots: do not take a snapshot
when a statement starts and only take one at the point that we need one. No
other changes to the MVCC mechanisms are proposed.

Is that possible?

The time the snapshot is taken is the time of the consistent viewpoint from
which all data access during a statement is judged. Taking the snapshot later,
at an undefined point in the future means that the consistent viewpoint is
actually floating. When we execute the statement we won't actually know which
viewpoint will be used to derive the answers to a query.

A floating, yet consistent viewpoint is in my opinion a good thing, since it
includes a more recent database state in the answer to a query than we would
otherwise have used. Consider the case where a very large table has a "select
count(*)" executed on it. The scan begins at block 0 and continues through the
table until the end, which for purposes of an example we will say takes 1 hour.
Rows are added to the table at a constant rate of 100/sec and immediately
committed. So by the time the scan has finished it will deliver an answer that
is wrong by 360000. Using a lazy snapshot would give us an answer almost exactly
correct, though of course Heisenbuggers may dispute the existence of a "correct"
answer in this case.

So let's look at some theory details:

* Scan begins, no snapshot set. A row is inserted and transaction commits. Scan
progresses until it sees a recent row. Scan takes snapshot; the row is now
visible to it and progresses. Another row is inserted and transaction commits.
When we later come to second new row, we already have a snapshot, so that row is
invisible to us. Results of query are consistent to the point we took the
snapshot, which happened when we saw the first row. Are the results consistent
only to end of transaction that created that row? No, other transactions can
also have committed after it and yet before we take snapshot. The recent
transaction is the catalyst for us to take a snapshot, though the snapshot is
not dependent upon the xid of the new row we have seen.

* Scan begins, no snapshot set. Ahead of scan a row that would have been visible
at start of scan is deleted, commits and removed by VACUUM/HOT. The scan has no
evidence that a removal has taken place, never sees contention and thus never
takes a snapshot. This isn't a problem; the row removal created an implicit xmin
for our scan. If we later took a snapshot the xmin of the snapshot would be
equal or later than our previous implicit xmin and so MVCC would be working.
This shows that it is wrong to presume that taking no snapshot at all means that
the time consistent point on the scan was at the start of a statement, it may
not be.

* We open a cursor, then start issuing updates where current of. Does the cursor
need a snapshot? I don't think it does, since we have special visibility rules
for rows produced by our own transaction and we do not need a snapshot to
disambiguate them. ISTM there may be a corner case where we need cursors to take
snapshots, but I haven't seen it yet.

Does that cover all the cases? Some main ones, but let's see if other problems
emerge? Anyone?

OK, in theory it seems to work, so how will it work in practice and does that
cause other difficulties?

* We will hold a new global variable LastGlobalXmin, which is maintained by
GetSnapshotData(). We can access it atomically, without locks.

* In XidInMVCCSnapshot() if our snapshot is NULL then we update RecentXmin from
LastGlobalXmin and test using that because we don't have a snapshot xmin. If
this is sufficient to return false then that's all we do. Otherwise, we now get
a full snapshot and then continue as normal. (There may be some API rework to
allow this to happen, so I think I papered over a few difficulties here, but in
broad terms, this appears to work).

Lazy snapshots mean that some things normally updated during snapshot taking
will fall behind somewhat. This has a couple of effects that we can mitigate in
various ways

* In TransactionIdIsInProgress() if xid < RecentXmin we update RecentXmin from
globalxmin and retry the test.

* We probably need to do something with HOT page cleaning as well, but that is
fairly subtle bit of tuning that I expect to see a range of viewpoints on.
Various options exist from do-nothing through to re-check xmin prior to each
cleaning check or somewhere in between.

I have no idea whether this idea is patented and I would appreciate some help in
researching whether this idea is legally able to be implemented by PGDG, so I
can remain untainted.

Benefits

* Scalability: The reduction in ProcArrayLock requests from snapshots will drop
away considerably as a result of these changes. (It may prove feasible to
provide an option to lightly partition the procarray to increase commit rate,
but that would be later)

* Hot Standby: Implementing this will likely significantly reduce the number of
queries cancelled during Hot Standby. This will be because many queries will not
have snapshots at all and the queries that do will typically have much younger
snapshots.

* Accuracy: More accurate answers to long database queries.

I will be removing various parts of code from Hot Standby patch while this is
discussed. I'm not very available at moment, so my replies are likely to be
considerably delayed.

Best Regards, Simon Riggs

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2009-08-18 10:43:30 Re: Lazy Snapshots
Previous Message Itagaki Takahiro 2009-08-18 10:11:27 fillfactor hides autovacuum parameters in 8.4.0