Re: On-the-fly index tuple deletion vs. hot_standby

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: On-the-fly index tuple deletion vs. hot_standby
Date: 2011-04-22 15:10:34
Message-ID: 20110422151034.GA8150@tornado.gateway.2wire.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> > On 12.03.2011 12:40, Noah Misch wrote:
> >> The installation that inspired my original report recently upgraded from 9.0.1
> >> to 9.0.3, and your fix did significantly decrease its conflict frequency. The
> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) I've
> >> attached a test script demonstrating the behavior. _bt_page_recyclable approves
> >> any page deleted no more recently than RecentXmin, because we need only ensure
> >> that every ongoing scan has witnessed the page as dead. For the hot standby
> >> case, we need to account for possibly-ongoing standby transactions. Using
> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> >> - vacuum_defer_cleanup_age. Not sure the accounting to achieve that would pay
> >> off, though. Thoughts?
> >
> > Hmm, instead of bloating the master, I wonder if we could detect more
> > accurately if there are any on-going scans, in the standby. For example,
> > you must hold a lock on the index to scan it, so only transactions
> > holding the lock need to be checked for conflict.
>
> That would be nice. Do you have an outline of an implementation in mind?

In an attempt to resuscitate this thread, here's my own shot at that. Apologies
in advance if it's just an already-burning straw man.

I didn't see any way to take advantage of checking for the heavyweight lock that
any index scan would need to hold. Such a check becomes obsolete the moment
it's done -- nothing stops another locker from arriving between the completion
of your check and whatever you do based on its outcome. When a standby is in
the picture, the answer needs to be decided at xlog generation time but valid
for xlog apply time. Therefore, I instead looked for what could be done with
buffer locks.

Regarding the underlying requirement, here is the relevant paragraph from
src/backend/access/nbtree/README:

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed. We implement this by waiting until all
transactions that were running at the time of deletion are dead; which is
overly strong, but is simple to implement within Postgres. When marked
dead, a deleted page is labeled with the next-transaction counter value.
VACUUM can reclaim the page for re-use when this transaction number is
older than the oldest open transaction.

These threads also have useful background as concerns hot standby:
http://archives.postgresql.org/message-id/23761.1265596434@sss.pgh.pa.us
http://archives.postgresql.org/message-id/4B7D3135.3090204@enterprisedb.com
http://archives.postgresql.org/message-id/1268123581.10620.76.camel@ebony

This is implemented in two passes. After updating the adjacent page links, the
first pass stores ReadNewTransactionId() in the page. The second pass finishes
the reuse if the stored xid is older than RecentXmin. This has the benefits of
simplicity and low contention. However, any long-running transaction delays
reuse. Also, since RecentXmin's witness of "all transactions that were running
at the time of deletion" only includes master transactions, our tools for
preventing recovery conflicts (vacuum_defer_cleanup_age, feedback) do not defend
against conflicts arising from b-tree page reuse. Fixing that means either
choosing different page reuse criteria or letting standby transactions also
delay reuse on the master.

If I understand correctly, "has a reference to it" implies a pin on the left,
right or parent page. By the time _bt_pagedel is ready to mark the page
BTP_DELETED, we already hold exclusive content locks on all of those buffers.
Suppose we then check the pin count on each. If they have only single local
pins, we have serendipitously acquired cleanup locks: set btpo.xact to
FrozenTransactionId, allowing reclamation at any time. Otherwise, continue as
the code stands today. Also change _bt_recycleable to use RecentGlobalXmin
instead of RecentXmin; this avoids the recovery conflicts. Since, we hope, the
vast majority of deleted pages will get FrozenTransactionId, the marginal bloat
added by using RecentGlobalXmin will not be significant. This also saves a
LW_SHARED acquisition of XidGenLock in the uncontended case -- probably not
significant, but it can't hurt.

When the standby replays the XLOG_BTREE_DELETE_PAGE with btpo_xact ==
FrozenTransactionId, it must wait for a cleanup lock on all the buffers it
updates. We could actually do this anytime between the XLOG_BTREE_DELETE_PAGE
and the later XLOG_BTREE_REUSE_PAGE, but we only have the necessary information
handy when applying the first record.

Something this intrusive is clearly 9.2 material. It would be nice to have a
backpatchable strategy to prop up vacuum_defer_cleanup_age and
hot_standby_feedback in 9.0 and 9.1. For that, I haven't come up with anything
better than my original pair of proposals.

Comments?

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-04-22 15:10:43 Re: "stored procedures"
Previous Message Leonardo Francalanci 2011-04-22 15:08:36 Re: What Index Access Method Functions are really needed?