Re: Race condition in b-tree page deletion

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Race condition in b-tree page deletion
Date: 2014-01-27 14:34:44
Message-ID: 52E66E84.2050109@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review!

On 01/18/2014 09:45 PM, Kevin Grittner wrote:
> The patch still applies with a few offsets. It compiles with these
> warnings (which were also reported by Peter Eisentraut and should
> be fixed):
>
> nbtpage.c: In function ‘_bt_pagedel’:
> nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> nbtpage.c:1414:18: note: ‘metad’ was declared here
> nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> nbtpage.c:1413:8: note: ‘metapg’ was declared here

Fixed.

> I'm not a fan of the new retry label introduced in _bt_pagedel()
> and the goto statement which references it. I think a loop using
> while() or for() would be cleaner. The whole idea of describing
> this logic recursively and then making a big deal about using tail
> recursion as an optimization seems pointless and confusing. It
> could just as easily be described as an iterative approach and save
> the extra logical step in explaining the code. I think that would
> be clearer and easier to understand. It would also eliminate the
> last parameter, which is always passed as NULL and only set to
> something else internally. I think it would be cleaner to make
> that a local variable initialized to NULL as part of eliminating
> the fiction of recursion here.

Ok, changed it to a loop, and moved the responsibility to construct the
'stack' into _bt_pagedel().

> There is a point in this loop where we return 0, but I think that
> is wrong; I think we should break so that the pages already deleted
> will be counted.

Fixed.

> On a related note. the caller doesn't actually
> use the count, but assumes that any non-zero value should be
> treated as 1.

Hmm. The caller does this:

> ndel = _bt_pagedel(rel, buf);
>
> /* count only this page, else may double-count parent */
> if (ndel)
> stats->pages_deleted++;

The problem here is that btvacuumpage() might visit the parent page
again, and count it again. The code is erring on the side of
undercounting; I'm not sure which is better.

> Similar issues with faux recursion existing the calling function,
> btvacuumpage(). In that case the unnecessary parameter which the
> caller must get right is orig_blkno, which must be the same as
> blkno on an actual call. That could be a local variable
> initialized to blkno within the function. I think this would be
> better described as iteration directly rather than claiming
> recursions and whining about how the compiler is too dumb to turn
> it into iteration for us. It's not the job of this patch to fix
> that in the caller, but let's not emulate it.

Agreed.

> Other than that, I have not found any problems.

While fixing the above issues, and reviewing this again in general, I
noticed that btree_xlog_unlink_page() was a few bricks shy of a load.
When unlinking an internal page, it didn't update the pointer to the new
topmost page in the to-be-deleted chain. Fixed that.

Attached is a new version. I also re-worded the README changes; I hope
it reads better now.

- Heikki

Attachment Content-Type Size
fix-btree-page-deletion-3.patch text/x-diff 64.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-01-27 14:39:19 Re: Race condition in b-tree page deletion
Previous Message Amit Langote 2014-01-27 14:24:40 Re: Retain dynamic shared memory segments for postmaster lifetime