Re: SPGiST versus hot standby - question about conflict resolution rules

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-03-13 02:50:36
Message-ID: 17129.1331607036@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

There is one more (known) stop-ship problem in SPGiST, which I'd kind of
like to get out of the way now before I let my knowledge of that code
get swapped out again. This is that SPGiST is unsafe for use by hot
standby slaves.

The problem comes from "redirect" tuples, which are short-lifespan
objects that replace a tuple that's been moved to another page.
A redirect tuple can be recycled as soon as no active indexscan could
be "in flight" from the parent index page to the moved tuple. SPGiST
implements this by marking each redirect tuple with the XID of the
creating transaction, and assuming that the tuple can be recycled once
that XID is below the OldestXmin horizon (implying that all active
transactions started after it ended). This is fine as far as
transactions on the master are concerned, but there is no guarantee that
the recycling WAL record couldn't be replayed on a hot standby slave
while there are still HS transactions that saw the old state of the
parent index tuple.

Now, btree has a very similar problem with deciding when it's safe to
recycle a deleted index page: it has to wait out transactions that could
be in flight to the page, and it does that by marking deleted pages with
XIDs. I see that the problem has been patched for btree by emitting a
special WAL record just before a page is recycled. However, I'm a bit
nervous about copying that solution, because the details are a bit
different. In particular, I see that btree marks deleted pages with
ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
rather than the XID of the originating transaction, and then it
subtracts one from the XID before sending it to the WAL stream.
The comments about this are not clear enough for me, and so I'm
wondering whether it's okay to use the originating transaction XID
in a similar way, or if we need to modify SPGiST's rule for how to
mark redirection tuples. I think that the use of ReadNewTransactionId
is because btree page deletion happens in VACUUM, which does not have
its own XID; this is unlike the situation for SPGiST where creation of
redirects is caused by index tuple insertion, so there is a surrounding
transaction with a real XID. But it's not clear to me how
GetConflictingVirtualXIDs makes use of the limitXmin and whether a live
XID is okay to pass to it, or whether we actually need "next XID - 1".

Info appreciated.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-03-13 07:04:26
Message-ID: CA+U5nMJaD-kmwQ=QPeMWBjeAgqs8b9jtooqdGGsAiW6wagsiUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 13, 2012 at 2:50 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Info appreciated.

Email seen, will reply when I can later today.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-04-19 06:55:16
Message-ID: 20120419065516.GC12337@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
> like to get out of the way now before I let my knowledge of that code
> get swapped out again. This is that SPGiST is unsafe for use by hot
> standby slaves.

I suspect that swap-out has passed, but ...

> The problem comes from "redirect" tuples, which are short-lifespan
> objects that replace a tuple that's been moved to another page.
> A redirect tuple can be recycled as soon as no active indexscan could
> be "in flight" from the parent index page to the moved tuple. SPGiST
> implements this by marking each redirect tuple with the XID of the
> creating transaction, and assuming that the tuple can be recycled once
> that XID is below the OldestXmin horizon (implying that all active
> transactions started after it ended). This is fine as far as
> transactions on the master are concerned, but there is no guarantee that
> the recycling WAL record couldn't be replayed on a hot standby slave
> while there are still HS transactions that saw the old state of the
> parent index tuple.
>
> Now, btree has a very similar problem with deciding when it's safe to
> recycle a deleted index page: it has to wait out transactions that could
> be in flight to the page, and it does that by marking deleted pages with
> XIDs. I see that the problem has been patched for btree by emitting a
> special WAL record just before a page is recycled. However, I'm a bit
> nervous about copying that solution, because the details are a bit
> different. In particular, I see that btree marks deleted pages with
> ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
> rather than the XID of the originating transaction, and then it
> subtracts one from the XID before sending it to the WAL stream.
> The comments about this are not clear enough for me, and so I'm

Attempting to write an explanation for that btree code led me conclude that
the code is incorrect. (FWIW, I caused that wrongness.) I will start a
separate thread to fix it.

> wondering whether it's okay to use the originating transaction XID
> in a similar way, or if we need to modify SPGiST's rule for how to
> mark redirection tuples. I think that the use of ReadNewTransactionId
> is because btree page deletion happens in VACUUM, which does not have
> its own XID; this is unlike the situation for SPGiST where creation of
> redirects is caused by index tuple insertion, so there is a surrounding
> transaction with a real XID. But it's not clear to me how
> GetConflictingVirtualXIDs makes use of the limitXmin and whether a live
> XID is okay to pass to it, or whether we actually need "next XID - 1".
>
> Info appreciated.

GetConflictingVirtualXIDs() selects transactions with pgaxt->xmin <=
limitXmin. The prototype use case was VACUUM, where limitXmin is the xmax of
a dead tuple cleaned from a page. Transactions with xmin <= limitXmin might
still regard the limitXmin XID as uncommitted; we conflict so they cannot fail
to see the tuple we're about to purge.

All hot standby transactions holding snapshots taken before the startup
process applies the tuple-mover transaction's commit record will have xmin <=
its XID. Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot()
meets the need here precisely. vacuum_defer_cleanup_age and
hot_standby_feedback inform GetOldestXmin(), so the master will delay cleanup
long enough to prevent conflicts when so configured.

Thanks,
nm


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-04-19 07:12:16
Message-ID: CA+U5nMKVu2BRug_ORh1FGYEiQ5jKPfkUecCzUUp=VAwU0VTg8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 19, 2012 at 7:55 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again.  This is that SPGiST is unsafe for use by hot
>> standby slaves.
>
> I suspect that swap-out has passed, but ...
>
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple.  SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended).  This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>>
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
>> XIDs.  I see that the problem has been patched for btree by emitting a
>> special WAL record just before a page is recycled.  However, I'm a bit
>> nervous about copying that solution, because the details are a bit
>> different.  In particular, I see that btree marks deleted pages with
>> ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
>> rather than the XID of the originating transaction, and then it
>> subtracts one from the XID before sending it to the WAL stream.
>> The comments about this are not clear enough for me, and so I'm
>
> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect.  (FWIW, I caused that wrongness.)  I will start a
> separate thread to fix it.

Wrong or not, we need to better document why we picked
ReadNewTransactionID(), rather than OldestXmin which seems the more
obvious and cheaper choice.

> All hot standby transactions holding snapshots taken before the startup
> process applies the tuple-mover transaction's commit record will have xmin <=
> its XID.  Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot()
> meets the need here precisely.

Yes, agreed. i.e. don't subtract 1.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-08-02 18:49:45
Message-ID: 27573.1343933385@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote:
>> There is one more (known) stop-ship problem in SPGiST, which I'd kind of
>> like to get out of the way now before I let my knowledge of that code
>> get swapped out again. This is that SPGiST is unsafe for use by hot
>> standby slaves.
>> The problem comes from "redirect" tuples, which are short-lifespan
>> objects that replace a tuple that's been moved to another page.
>> A redirect tuple can be recycled as soon as no active indexscan could
>> be "in flight" from the parent index page to the moved tuple. SPGiST
>> implements this by marking each redirect tuple with the XID of the
>> creating transaction, and assuming that the tuple can be recycled once
>> that XID is below the OldestXmin horizon (implying that all active
>> transactions started after it ended). This is fine as far as
>> transactions on the master are concerned, but there is no guarantee that
>> the recycling WAL record couldn't be replayed on a hot standby slave
>> while there are still HS transactions that saw the old state of the
>> parent index tuple.
>>
>> Now, btree has a very similar problem with deciding when it's safe to
>> recycle a deleted index page: it has to wait out transactions that could
>> be in flight to the page, and it does that by marking deleted pages with
>> XIDs.

> Attempting to write an explanation for that btree code led me conclude that
> the code is incorrect. (FWIW, I caused that wrongness.) I will start a
> separate thread to fix it.

After reviewing this thread and the one at
http://archives.postgresql.org/message-id/20120421165202.GA13458@tornado.leadboat.com
I believe that the SPGiST issue should be fixed as follows:

* It's okay for transactions inserting redirection tuples to use their
own XID as the marker.

* In spgvacuum.c, the test in vacuumRedirectAndPlaceholder to decide if
a redirection tuple is recyclable should use
TransactionIdPrecedes(dt->xid, RecentGlobalXmin)
rather than comparing against OldestXmin as it does now. (This allows
some consequent simplification since the module need not pass around
an OldestXmin parameter.)

* vacuumRedirectAndPlaceholder must also compute the newest XID among
the redirection tuples it removes from the page, and store that in
a new field of XLOG_SPGIST_VACUUM_REDIRECT WAL records.

* In spgRedoVacuumRedirect, call ResolveRecoveryConflictWithSnapshot
with the newest-redirect XID.

I had first thought that we could avoid a change in WAL contents by
having spgRedoVacuumRedirect compute the cutoff XID for itself by
examining the removed tuples, but that doesn't work: XLogInsert might
for instance have decided to store a full-page image, in which case the
information isn't available by inspection of the old page contents.
But we still have to enforce the interlock against hot standby xacts.

In principle we should bump the XLOG page magic number for this change
in WAL contents, but I'm inclined not to because it'd cause pain for
beta testers, and there are probably very few people who'd have live
XLOG_SPGIST_VACUUM_REDIRECT records in play when they switch to beta3.

Comments?

regards, tom lane


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-08-03 12:37:54
Message-ID: 20120803123754.GC9683@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 02, 2012 at 02:49:45PM -0400, Tom Lane wrote:
> I believe that the SPGiST issue should be fixed as follows:
>
> * It's okay for transactions inserting redirection tuples to use their
> own XID as the marker.
>
> * In spgvacuum.c, the test in vacuumRedirectAndPlaceholder to decide if
> a redirection tuple is recyclable should use
> TransactionIdPrecedes(dt->xid, RecentGlobalXmin)
> rather than comparing against OldestXmin as it does now. (This allows
> some consequent simplification since the module need not pass around
> an OldestXmin parameter.)
>
> * vacuumRedirectAndPlaceholder must also compute the newest XID among
> the redirection tuples it removes from the page, and store that in
> a new field of XLOG_SPGIST_VACUUM_REDIRECT WAL records.
>
> * In spgRedoVacuumRedirect, call ResolveRecoveryConflictWithSnapshot
> with the newest-redirect XID.

All the above looks reasonable, as does the committed patch.

There's an obsolete comment in spg_redo().

> I had first thought that we could avoid a change in WAL contents by
> having spgRedoVacuumRedirect compute the cutoff XID for itself by
> examining the removed tuples, but that doesn't work: XLogInsert might
> for instance have decided to store a full-page image, in which case the
> information isn't available by inspection of the old page contents.
> But we still have to enforce the interlock against hot standby xacts.

We achieve that division of labor for XLOG_BTREE_DELETE by examining the old
contents before RestoreBkpBlocks(). This is safe, I think, because we only
examine the page when the system has running hot standby backends, and we only
allow hot standby connections when recovery has proceeded far enough to fix
all torn and ahead-of-EndRecPtr pages. Note that this decision is a larger
win for XLOG_BTREE_DELETE than it would be for XLOG_SPGIST_VACUUM_REDIRECT.
For XLOG_BTREE_DELETE, finding the cutoff XID could require fetching and
examining hundreds of heap pages. For XLOG_SPGIST_VACUUM_REDIRECT, doing so
just adds a few cycles to an existing page scan. I think the simpler way
you've implemented it is better for the long term.

> In principle we should bump the XLOG page magic number for this change
> in WAL contents, but I'm inclined not to because it'd cause pain for
> beta testers, and there are probably very few people who'd have live
> XLOG_SPGIST_VACUUM_REDIRECT records in play when they switch to beta3.

Seems fair.

Thanks,
nm


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-08-03 17:14:31
Message-ID: 20141.1344014071@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Thu, Aug 02, 2012 at 02:49:45PM -0400, Tom Lane wrote:
>> * In spgRedoVacuumRedirect, call ResolveRecoveryConflictWithSnapshot
>> with the newest-redirect XID.

> There's an obsolete comment in spg_redo().

[ squint ... ] Comparing that to btree_redo, I realize that there's a
bug in what I did yesterday: the ResolveRecoveryConflictWithSnapshot
call has to happen before we call RestoreBkpBlocks, else the new state
of the index page could be exposed to processes that need the old one.

Will fix. I think the code in btree_redo could use a better (or any)
comment about this point, too.

>> But we still have to enforce the interlock against hot standby xacts.

> We achieve that division of labor for XLOG_BTREE_DELETE by examining the old
> contents before RestoreBkpBlocks(). This is safe, I think, because we only
> examine the page when the system has running hot standby backends, and we only
> allow hot standby connections when recovery has proceeded far enough to fix
> all torn and ahead-of-EndRecPtr pages.

Egad. That's seriously underdocumented as well. And I think it needs
an explicit test that the page is actually older than the current WAL
record, because it would otherwise be doing the wrong thing.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Re: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-08-03 18:02:35
Message-ID: 29837.1344016955@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
>> We achieve that division of labor for XLOG_BTREE_DELETE by examining the old
>> contents before RestoreBkpBlocks(). This is safe, I think, because we only
>> examine the page when the system has running hot standby backends, and we only
>> allow hot standby connections when recovery has proceeded far enough to fix
>> all torn and ahead-of-EndRecPtr pages.

> Egad. That's seriously underdocumented as well.

And buggy, now that I look at it. Suppose a new backend starts
immediately after CountDBBackends runs? That's okay, because it can't
possibly have seen any page links that we would need to conflict for
... but because the code is lazy and returns InvalidTransactionId,
we'll throw a conflict on it anyway.

The case where the loop doesn't find any live tuples seems problematic
as well. There's a comment asserting that this will happen "very
rarely", which is neither comforting nor backed up by any evidence or
argument. ISTM that a page that VACUUM has removed tuples from might
very well have a preimage consisting only of dead tuples.

regards, tom lane