VACUUM FULL deadlock with backend startup

Lists: pgsql-hackers
From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: VACUUM FULL deadlock with backend startup
Date: 2011-03-19 03:25:36
Message-ID: AANLkTi=zQZz7rcqyxzgz00MgUDxK6U=876SVoqDUO7h-@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

We encountered a deadlock involving VACUUM FULL (surprise surprise!
:)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the
window appears much smaller). The call spits out the following
deadlock info:

ERROR: SQLSTATE 40P01: deadlock detected
DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663
of database 16384; blocked by process 14827.
Process 14827 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 12479.
LOCATION: DeadLockReport, deadlock.c:918

It looked familiar, so I dug up the archives and found that Tom had
committed a fix for a similar deadlock via git commitid: 715120e7

However this current deadlock involved an index with oid 2663, which
is ClassNameNspIndexId. Clearly this was another case of locking the
index directly without taking a lock on the parent catalog. Further
sleuthing revealed that the culprit function was InitCatCachePhase2,
which directly calls index_open in the process startup phase.

Reproducing this was easy once you know the culprit, (excruciatingly
difficult if you do not know the exact race window). I added a sleep
inside the InitCatCachePhase2 function before calling index_open. Then
I invoked a "VACUUM FULL pg_class" from another session, halting it in
gdb just before taking the exclusive lock via try_relation_open. When
a new PG process sleeps inside InitCatCachePhase2, we then take the
lock in the VF process, waiting just after it.

When the startup continues after the sleep, it will take the
ClassNameNspIndexId share lock, but hang to take a share lock on
pg_class in RelationReloadIndexInfo. Simply continue the VF process in
gdb which will try to take the exclusive lock to vacuum the index.
This will reproduce the deadlock in all its glory.

The fix is similar to the earlier commit by Tom. I tested this fix
against 8.3.13. We lock the parent catalog now before calling
index_open. Patch against git HEAD attached with this mail. I guess we
will backpatch this? Tom's last commit was backpatched till 8.2 I
think.

Regards,
Nikhils

Attachment Content-Type Size
vacuum_full_deadlock_head.patch text/x-diff 1009 bytes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-19 13:45:10
Message-ID: AANLkTinUbjbA5rB3Qb0k87DAJyaq08fGV9W=4Nv8DMRj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 18, 2011 at 11:25 PM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> The fix is similar to the earlier commit by Tom. I tested this fix
> against 8.3.13. We lock the parent catalog now before calling
> index_open. Patch against git HEAD attached with this mail. I guess we
> will backpatch this? Tom's last commit was backpatched till 8.2 I
> think.

Is it really worth slowing down the startup sequence for every
connection to avoid deadlocking against a very rare maintenance
operation?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-19 14:46:48
Message-ID: AANLkTikjUFoTcCvFpfS4okptBKepfhMryYgMF=8VmYXT@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> The fix is similar to the earlier commit by Tom. I tested this fix

> > against 8.3.13. We lock the parent catalog now before calling
> > index_open. Patch against git HEAD attached with this mail. I guess we
> > will backpatch this? Tom's last commit was backpatched till 8.2 I
> > think.
>
> Is it really worth slowing down the startup sequence for every
> connection to avoid deadlocking against a very rare maintenance
> operation?
>
>
Not really a performance issue AFAICS. If the relcache init file exists,
then the phase2 of the catalog cache which eventually calls the above code
path is avoided.

Regards,
Nikhils


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-20 03:15:10
Message-ID: AANLkTikTwk+dAQ7oVngPJVNwZQZbM8=PPXzZr5jCx4QL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 19, 2011 at 10:46 AM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> Not really a performance issue AFAICS. If the relcache init file exists,
> then the phase2 of the catalog cache which eventually calls the above code
> path is avoided.

Oh, that doesn't sound so bad, then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-20 06:24:36
Message-ID: 16023.1300602276@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Mar 19, 2011 at 10:46 AM, Nikhil Sontakke
> <nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>> Not really a performance issue AFAICS. If the relcache init file exists,
>> then the phase2 of the catalog cache which eventually calls the above code
>> path is avoided.

> Oh, that doesn't sound so bad, then.

I want to take a closer look at this one because I thought I'd covered
all those issues in the last go-round. But right offhand Nikhil's
analysis seems sane.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-20 11:58:32
Message-ID: 07143368-BC71-4ACB-8B58-CF98BA515517@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 20, 2011, at 2:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sat, Mar 19, 2011 at 10:46 AM, Nikhil Sontakke
>> <nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>>> Not really a performance issue AFAICS. If the relcache init file exists,
>>> then the phase2 of the catalog cache which eventually calls the above code
>>> path is avoided.
>
>> Oh, that doesn't sound so bad, then.
>
> I want to take a closer look at this one because I thought I'd covered
> all those issues in the last go-round.

Go for it.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-22 17:03:30
Message-ID: 1155.1300813410@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com> writes:
> It looked familiar, so I dug up the archives and found that Tom had
> committed a fix for a similar deadlock via git commitid: 715120e7

> However this current deadlock involved an index with oid 2663, which
> is ClassNameNspIndexId. Clearly this was another case of locking the
> index directly without taking a lock on the parent catalog. Further
> sleuthing revealed that the culprit function was InitCatCachePhase2,
> which directly calls index_open in the process startup phase.

Patch applied, thanks!

I did a bit of extra digging around the cache modules and could not find
any other instances of the same problem, though I did find some places
that seemed worthy of a comment about how they avoid it.

regards, tom lane


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VACUUM FULL deadlock with backend startup
Date: 2011-03-22 17:55:31
Message-ID: AANLkTi=siAM5AxC0Zt0ntDxssM6n+R2ZZ8Ve7ihRBW6T@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Patch applied, thanks!
>
>
Thanks Tom!

Regards,
Nikhils