Re: Deadlock in vacuum (check fails)

Lists: pgsql-hackers
From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Deadlock in vacuum (check fails)
Date: 2010-01-13 10:11:27
Message-ID: 4B4D9C4F.1030604@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I found following strange error on gothic moth:

================== pgsql.22658/src/test/regress/regression.diffs
===================
***
/zfs_data/home/postgres/buildfarm/gothic_moth/HEAD/pgsql.22658/src/test/regress/expected/vacuum.out
Tue Jan 12 22:06:13 2010
---
/zfs_data/home/postgres/buildfarm/gothic_moth/HEAD/pgsql.22658/src/test/regress/results/vacuum.out
Tue Jan 12 22:35:25 2010
***************
*** 94,99 ****
--- 94,103 ----
-- only non-system tables should be changed
VACUUM FULL pg_am;
VACUUM FULL pg_class;
+ ERROR: deadlock detected
+ DETAIL: Process 5913 waits for AccessExclusiveLock on relation 2662
of database 16384; blocked by process 5915.
+ Process 5915 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 5913.
+ HINT: See server log for query details.
VACUUM FULL pg_database;
VACUUM FULL vaccluster;
VACUUM FULL vacid;

Detail here:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=gothic_moth&dt=2010-01-12%2021:06:01

Zdenek


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Deadlock in vacuum (check fails)
Date: 2010-01-13 19:27:58
Message-ID: 1582.1263410878@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> I found following strange error on gothic moth:

> VACUUM FULL pg_class;
> + ERROR: deadlock detected
> + DETAIL: Process 5913 waits for AccessExclusiveLock on relation 2662
> of database 16384; blocked by process 5915.
> + Process 5915 waits for AccessShareLock on relation 1259 of database
> 16384; blocked by process 5913.
> + HINT: See server log for query details.

The server log shows that 5913 was trying to VACUUM FULL pg_class, and
5915 was in the midst of backend startup. I believe that what is
happening is that 5915 was doing a full startup without relcache init
file (which had likely been deleted by a previous vacuum) and it was
in the middle of load_critical_index() for index 2662 =
pg_class_oid_index. It would have needed to read pg_class for that.
Meanwhile the VACUUM FULL had ex-lock on pg_class and needed to lock
its indexes.

So basically what this boils down to is that load_critical_index is
locking an index before locking its underlying relation, which generally
speaking is against our coding rules. The relcache code has been like
that since 8.2, so I'm a bit surprised that we have not seen this
reported before. It could only happen when the relcache init file is
missing, which isn't the normal state, but it's not exactly unusual
either.

Probably the appropriate fix is to make load_critical_index get
AccessShare lock on the underlying catalog not only the index it's
after. This would slow things down a tad, but since it's not in the
normal startup path I don't think that matters.

Should we back-patch this? The bug appears to be real back to 8.2,
but if we've not noticed before, I'm not sure how important it is.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Deadlock in vacuum (check fails)
Date: 2010-02-01 14:45:32
Message-ID: 407d949e1002010645w7e55da38q9c5c1736478d014e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 13, 2010 at 7:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So basically what this boils down to is that load_critical_index is
> locking an index before locking its underlying relation, which generally
> speaking is against our coding rules.  The relcache code has been like
> that since 8.2, so I'm a bit surprised that we have not seen this
> reported before.  It could only happen when the relcache init file is
> missing, which isn't the normal state, but it's not exactly unusual
> either.
>
> Probably the appropriate fix is to make load_critical_index get
> AccessShare lock on the underlying catalog not only the index it's
> after.  This would slow things down a tad, but since it's not in the
> normal startup path I don't think that matters.
>
> Should we back-patch this?  The bug appears to be real back to 8.2,
> but if we've not noticed before, I'm not sure how important it is.

Does this create a problem combined with the plan to allow the new
VACUUM FULL to rewrite system tables? From my brief scan it sounds
like the only reason there's no race condition here is that previously
the oid of system tables couldn't change out from under
load_critical_index. If instead it's possible for them to would it be
possible for it to try to load this index, find the oid of the
underlying table, then go to lock the underlying table only to find it
had been vacuumed out from under it and no longer exists?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Deadlock in vacuum (check fails)
Date: 2010-02-01 15:11:29
Message-ID: 19782.1265037089@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Does this create a problem combined with the plan to allow the new
> VACUUM FULL to rewrite system tables? From my brief scan it sounds
> like the only reason there's no race condition here is that previously
> the oid of system tables couldn't change out from under
> load_critical_index.

Their OIDs will never change, so I don't see an issue. Locks are taken
on OIDs not relfilenodes, so a relocation doesn't break locking. If
it did we'd already have a problem with relocating user tables.

regards, tom lane