Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: daveg <daveg(at)sonic(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: FATAL: lock AccessShareLock on object 0/1260/0 is already held
Date: 2011-08-23 16:15:23
Message-ID: CA+Tgmob-sXsNg9zrBPW2h0eETmy0L5EHV_0wfZAH8GJ6xa9SrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 22, 2011 at 3:31 AM, daveg <daveg(at)sonic(dot)net> wrote:
> So far I've got:
>
>  - affects system tables
>  - happens very soon after process startup
>  - in 8.4.7 and 9.0.4
>  - not likely to be hardware or OS related
>  - happens in clusters for period of a few second to many minutes
>
> I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
> hard to get downtime to pick up new builds. Any other ideas on getting to
> the bottom of this?

I've been thinking this one over, and doing a little testing. I'm
still stumped, but I have a few thoughts. What that error message is
really saying is that the LOCALLOCK bookkeeping doesn't match the
PROCLOCK bookkeeping; it doesn't tell us which one is to blame.

My first thought was that there might be some situation where
LockAcquireExtended() gets an interrupt between the time it does the
LOCALLOCK lookup and the time it acquires the partition lock. If the
interrupt handler were to acquire (but not releases) a lock in the
meantime, then we'd get confused. However, I can't see how that's
possible. I inserted some debugging code to fail an assertion if
CHECK_FOR_INTERRUPTS() gets invoked in between those two points or if
ImmediateInterruptOK is set on entering the function, and the system
still passes regression tests.

My second thought is that perhaps a process is occasionally managing
to exit without fully cleaning up the associated PROCLOCK entry. At
first glance, it appears that this would explain the observed
symptoms. A new backend gets the PGPROC belonging to the guy who
didn't clean up after himself, hits the error, and disconnects,
sticking himself right back on to the head of the SHM_QUEUE where the
next connection will inherit the same PGPROC and hit the same problem.
But it's not clear to me what could cause the system to get into this
state in the first place, or how it would eventually right itself.

It might be worth kludging up your system to add a test to
InitProcess() to verify that all of the myProcLocks SHM_QUEUEs are
either NULL or empty, along the lines of the attached patch (which
assumes that assertions are enabled; otherwise, put in an elog() of
some sort). Actually, I wonder if we shouldn't move all the
SHMQueueInit() calls for myProcLocks to InitProcGlobal() rather than
doing it over again every time someone calls InitProcess(). Besides
being a waste of cycles, it's probably less robust this way. If
there somehow are leftovers in one of those queues, the next
successful call to LockReleaseAll() ought to clean up the mess, but of
course there's no chance of that working if we've nuked the queue
pointers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
initprocess-assert.patch application/octet-stream 608 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-23 16:39:20 Re: cheaper snapshots redux
Previous Message Tom Lane 2011-08-23 16:13:13 Re: cheaper snapshots redux