Bugs in b-tree dead page removal

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Bugs in b-tree dead page removal
Date: 2010-02-08 02:33:54
Message-ID: 23761.1265596434@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Whilst looking around for stuff that could be deleted thanks to removing
old-style VACUUM FULL, I came across some code in btree that seems
rather seriously buggy. For reasons explained in nbtree/README, we
can't physically recycle a "deleted" btree index page until all
transactions open at the time of deletion are gone --- otherwise we
might re-use a page that an existing scan is about to land on, and
confuse that scan. (This condition is overly strong, of course, but
it's what's designed in at the moment.) The way this is implemented
is to label a freshly-deleted page with the current value of
ReadNewTransactionId(). Once that value is older than RecentXmin,
the page is presumed recyclable.

I think this was all right when it was designed, but isn't it rather
badly broken by our subsequent changes to have transactions not take
out an XID until/unless they write something? A read-only transaction
could easily be much older than RecentXmin, no?

The odds of an actual problem seem not very high, since to be affected
a scan would have to be already "in flight" to the problem page when
the deletion occurs. By the time RecentXmin advances and we feed the
page to the FSM and get it back, the scan's almost surely going to have
arrived. And I think the logic is such that this would not happen
before the next VACUUM in any case. Still, it seems pretty bogus.

Another issue is that it's not clear what happens in a Hot Standby
slave --- it doesn't look like Simon put any interlocking in this
area to protect slave queries against having the page disappear
from under them. The odds of an actual problem are probably a
good bit higher in an HS slave.

And there's another problem: _bt_pagedel is designed to recurse
in certain improbable cases, but I think this is flat out wrong
when doing WAL replay --- if the original process did recurse
then it will have emitted a WAL record for each deleted page,
meaning replay would try to delete twice.

That last problem is easy to fix, but I'm not at all sure what to do
about the scan interlock problem. Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-02-08 02:58:59 Re: Bugs in b-tree dead page removal
Previous Message Andres Freund 2010-02-08 01:31:42 Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)