Re: lock on object is already held

From: Daniel Wood <dwood(at)salesforce(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lock on object is already held
Date: 2013-11-27 03:41:44
Message-ID: CAPweHKfsGjSzry1y-A4dE-OUWyMqNUFYB0fmJa5APWGdabKiEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Does the original version of my stress test not repro the problem on 9.2?

My primary concern with the fix is that I simply didn't understand what
might happen after a failed lock attempt called CleanUpLock freeing the
PROCLOCK but leaving some LOCALLOCK still pointing at it. As long as
"nLocks == 0" is guarenteed I guess we are OK although LockRelease will
elog a WARNING and LockReleaseAll believes "/* ... we must've run out of
shared memory ... */".

Why does LockAcquireExtended() test for "nLocks == 0" in the "if
(dontWait)" block before calling RemoveLocalLock()? Can nLocks be anything
other than 0 if we've just freed the PROCLOCK in this block of code? If
nLocks is > 0 AND pointing at a freed PROCLOCK can't we still have a
problem. Given that "dontWait==true" seems to be associated with DDL and
other rare things it might be hard to stress test this case.

On Tue, Nov 26, 2013 at 5:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Daniel Wood <dwood(at)salesforce(dot)com> writes:
> > Sorry but I don't know how to respond to an old thread I found on
> > postgresql.org:
> > http://www.postgresql.org/message-id/20766.1357318482@sss.pgh.pa.us
>
> > I presume this is still an open issue. While working on a new feature I
> > wrote a stress test for it. After fixing my bugs, I couldn't get rid of:
> > ERROR: lock RowExclusiveLock on object 16384/16393/0 is already held
> > In addition, with asserts enabled in lock.c I would see Asserts in
> > UnGrantLock, LockRelease, LockReleaseAll, etc. In one run everything
> hung
> > waiting on SQL locks.
>
> > The asserts or hang occurs within seconds of running the stress test.
> > Before I get into the details I want to see if this is something already
> > being worked on. I can repro it easily in vanilla 9.3.
>
> Dan sent me this test case off-list. Here's a somewhat cleaned-up
> version:
>
> 1. Create a table:
> create table lock_bug(f1 int, f2 int);
>
> 2. Compile the attached lockbug.c.
>
> 3. Run the attached lockbug.sh, with adjustment of parameters to taste.
>
> This spits up in a remarkable variety of ways in 9.3 and HEAD,
> especially if you have assertions on.
>
> After some investigation I found the cause: LockRelease() expects that
> if a LOCALLOCK object represents a valid lock (nLocks > 0), then
> either its lock and proclock fields point to associated shared-memory
> entries, or they're NULL. However, it's possible for that to not be
> true, because if we fail to acquire a lock, the code leaves around a
> LOCALLOCK object pointing at the shared objects we couldn't get lock
> on. *These objects are subject to reclamation, because no lock is
> actually held*. So if we make another attempt to get the same lock
> later in the same transaction, LockAcquire finds and re-uses that
> LOCALLOCK object. Pre fast-path locks, it would always recompute the
> lock and proclock pointers, so the fact that they might have been stale
> wasn't a problem. But the fast-path patch added a code path in which
> we could succeed in acquiring a fast-path lock, and then exit without
> having done anything with the pointer fields. Now we have something
> that looks like a valid locallock but could be pointing at entirely
> unrelated shared objects. This leads to all sorts of fun at release
> time, with the symptoms depending on exactly what those shared
> hashtable entries are being used for at the instant we stomp on them.
>
> So the fix is pretty simple:
>
> diff --git a/src/backend/storage/lmgr/lock.c
> b/src/backend/storage/lmgr/lock.c
> index f8dc951..37c605f 100644
> *************** LockAcquireExtended(const LOCKTAG *lockt
> *** 837,842 ****
> --- 844,851 ----
> LWLockRelease(MyProc->backendLock);
> if (acquired)
> {
> + locallock->lock = NULL;
> + locallock->proclock = NULL;
> GrantLockLocal(locallock, owner);
> return LOCKACQUIRE_OK;
> }
>
> although this needs to be backed up with a lot more comments of course.
>
> When I showed this to Dan, he objected that it'd be better if
> the failing lock operation had cleaned up the LOCALLOCK instead.
> That would be a significantly bigger change though, and I think
> it'd net out being less robust. The data structure was designed
> to not need such cleanup, because the more stuff you have to do
> to clean up after a failure, the more likely it is that you'll
> have additional problems during the cleanup, leaving you hosed.
> In particular, we'd have to be certain that we could remove the
> useless shared objects during the cleanup step, since once the
> LOCALLOCK is gone there is nothing else that will remind us to
> garbage-collect them at end of transaction.
>
> BTW, although I'm certain that 9.2 has this bug as well, this
> test case fails to show a problem in that branch. I've not
> looked into why not --- it's probably a timing issue or something.
>
> Comments?
>
> regards, tom lane
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru Hanada 2013-11-27 03:53:12 Re: Custom Scan APIs (Re: Custom Plan node)
Previous Message Gurjeet Singh 2013-11-27 03:12:38 Re: Shave a few instructions from child-process startup sequence