Re: On-the-fly index tuple deletion vs. hot_standby

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: On-the-fly index tuple deletion vs. hot_standby
Date: 2011-06-12 03:40:59
Message-ID: 20110612034059.GD21098@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert,

On Sat, Jun 11, 2011 at 08:55:28PM -0400, Robert Haas wrote:
> On Fri, Apr 22, 2011 at 11:10 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> >> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> >> > On 12.03.2011 12:40, Noah Misch wrote:
> >> >> The installation that inspired my original report recently upgraded from 9.0.1
> >> >> to 9.0.3, and your fix did significantly decrease its conflict frequency. ?The
> >> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> >> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) ?I've
> >> >> attached a test script demonstrating the behavior. ?_bt_page_recyclable approves
> >> >> any page deleted no more recently than RecentXmin, because we need only ensure
> >> >> that every ongoing scan has witnessed the page as dead. ?For the hot standby
> >> >> case, we need to account for possibly-ongoing standby transactions. ?Using
> >> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> >> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> >> >> - vacuum_defer_cleanup_age. ?Not sure the accounting to achieve that would pay
> >> >> off, though. ?Thoughts?
> >> >
> >> > Hmm, instead of bloating the master, I wonder if we could detect more
> >> > accurately if there are any on-going scans, in the standby. For example,
> >> > you must hold a lock on the index to scan it, so only transactions
> >> > holding the lock need to be checked for conflict.
> >>
> >> That would be nice. ?Do you have an outline of an implementation in mind?
> >
> > In an attempt to resuscitate this thread, here's my own shot at that. ?Apologies
> > in advance if it's just an already-burning straw man.
> >
> > I didn't see any way to take advantage of checking for the heavyweight lock that
> > any index scan would need to hold.
>
> Have you looked at the logic in ResolveRecoveryConflictWithLock(), and
> at GetLockConflicts()?
>
> I am a little fuzzy on how the btree stuff works, but it seems to me
> that you are looking for transactions that both have an xmin before
> some threshold and also hold an AccessShareLock on some relation.
> GetLockConflicts() will provide the latter, at least.

Thanks for taking a look.

For the purpose of B-tree page reuse, we don't directly care about the xmin of
any active snapshot. We need only prove that no active scan is paused
adjacent to the page, holding a right-link to it.

We currently achieve that wait-free by first marking the page with the next
available xid and then reusing it when that mark (btpo.xact) predates the
oldest running xid (RecentXmin). (At the moment, I'm failing to work out why
this is OK with scans from transactions that haven't allocated an xid, but I
vaguely recall convincing myself it was fine at one point.) It would indeed
also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock)
and check whether any of the returned transactions have PGPROC.xmin below the
mark. That's notably more expensive than just comparing RecentXmin, so I'm
not sure how well it would pay off overall. However, it could only help us on
the master. (Not strictly true, but any way I see to extend it to the standby
has critical flaws.) On the master, we can see a conflicting transaction and
put off reusing the page. By the time the record hits the standby, we have to
apply it, and we might have a running transaction that will hold a lock on the
index for the next, say, 72 hours. At such times, vacuum_defer_cleanup_age or
hot_standby_feedback ought to prevent the recovery stall.

This did lead me to realize that what we do in this regard on the standby can
be considerably independent from what we do on the master. If fruitful, the
standby can prove the absence of a scan holding a right-link in a completely
different fashion. So, we *could* take the cleanup-lock approach on the
standby without changing very much on the master.

nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-12 03:56:24 Re: psql: missing tab completions for COMMENT ON
Previous Message Robert Haas 2011-06-12 02:37:59 Re: Range Types and extensions