SPGiST versus hot standby - question about conflict resolution rules

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: SPGiST versus hot standby - question about conflict resolution rules
Date: 2012-03-13 02:50:36
Message-ID: 17129.1331607036@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

There is one more (known) stop-ship problem in SPGiST, which I'd kind of
like to get out of the way now before I let my knowledge of that code
get swapped out again. This is that SPGiST is unsafe for use by hot
standby slaves.

The problem comes from "redirect" tuples, which are short-lifespan
objects that replace a tuple that's been moved to another page.
A redirect tuple can be recycled as soon as no active indexscan could
be "in flight" from the parent index page to the moved tuple. SPGiST
implements this by marking each redirect tuple with the XID of the
creating transaction, and assuming that the tuple can be recycled once
that XID is below the OldestXmin horizon (implying that all active
transactions started after it ended). This is fine as far as
transactions on the master are concerned, but there is no guarantee that
the recycling WAL record couldn't be replayed on a hot standby slave
while there are still HS transactions that saw the old state of the
parent index tuple.

Now, btree has a very similar problem with deciding when it's safe to
recycle a deleted index page: it has to wait out transactions that could
be in flight to the page, and it does that by marking deleted pages with
XIDs. I see that the problem has been patched for btree by emitting a
special WAL record just before a page is recycled. However, I'm a bit
nervous about copying that solution, because the details are a bit
different. In particular, I see that btree marks deleted pages with
ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
rather than the XID of the originating transaction, and then it
subtracts one from the XID before sending it to the WAL stream.
The comments about this are not clear enough for me, and so I'm
wondering whether it's okay to use the originating transaction XID
in a similar way, or if we need to modify SPGiST's rule for how to
mark redirection tuples. I think that the use of ReadNewTransactionId
is because btree page deletion happens in VACUUM, which does not have
its own XID; this is unlike the situation for SPGiST where creation of
redirects is caused by index tuple insertion, so there is a surrounding
transaction with a real XID. But it's not clear to me how
GetConflictingVirtualXIDs makes use of the limitXmin and whether a live
XID is okay to pass to it, or whether we actually need "next XID - 1".

Info appreciated.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2012-03-13 02:51:50 Re: xlog location arithmetic
Previous Message Noah Misch 2012-03-13 02:41:49 Re: Measuring relation free space