Re: Deadlock when updating table partitions (and presumed solution)

Lists: pgsql-general
From: Paul Boddie <paul(at)boddie(dot)org(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Deadlock when updating table partitions (and presumed solution)
Date: 2007-12-03 15:15:59
Message-ID: 713b913c-e599-446a-9f5b-c60bb874a523@s8g2000prg.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I recently encountered an interesting situation with regard to
partitioned tables, concurrent updates and deadlocks which probably
has an obvious explanation, although I can't seem to find one in the
manual. Below, I explain the situation and provide some of my own
naive reasoning about what seems to be happening. Since I think I now
know how to avoid such matters, this message is mostly for the
purposes of sharing my recent experiences with those who may one day
encounter similar problems. I'd be grateful if anyone can explain what
must really be occurring and correct any erroneous conclusions,
however.

I have one process querying a table P with partitions P0, P1, P2, ...
Pn joined with table R as follows:

select * from R inner join P on R.id = P.id and P.section = 5

...where the column "section" determines which partition shall be
searched utilising the constraint exclusion support in PostgreSQL.
Here, I use the specific value of 5 to indicate that the involvement
of a specific partition is envisaged.

Now, each partition of P is created inheriting from P, and I also
include a rule which "redirects" inserts from P to the specific
partition of P depending on the value of "section". This latter detail
is, I believe, the principal contributing factor to the problems
subsequently experienced.

I have another process performing updates to individual partitions of
P - specifically "alter table" operations adding foreign key
constraints referencing R as follows:

alter table Pm add constraint Pm_fk_id foreign key(id) references
R(id)

...where "m" is the specific partition number, starting at 0,
increasing by 1, ending at n.

What seems to happen, by looking at pg_lock (and pg_class) is that the
following sequence of events occurs:

1. The query process acquires an AccessShareLock on R and P.
2. The update process acquires an AccessExclusiveLock on Pm and seeks
an AccessExclusiveLock on R.
3. The query process seeks an AccessShareLock on P0 ... Pn.
4. Deadlock is declared.

Since the query should only involve a single partition of P, one might
expect that the query process might immediately obtain an
AccessShareLock on P5, but what seems to occur is a race condition:
the update process is sometimes able to acquire a lock on P5 before
the query process is able to realise the involvement of P5 in the
query operation. Moreover, a deadlock occurs even when the update
process is adding the foreign key constraint to tables other than P5,
suggesting as I note above that all child tables are involved in the
query operation.

My initial conclusions were as follows:

1. A query on a partitioned table only initially causes lock
acquisition on the parent table.
2. Subsequent attempts to acquire locks on child tables conflict with
the locking done by the "alter table" operation.
3. The classic solution (ensure consistent lock acquisition order)
may not be readily applicable.

Intuitively, I understood that PostgreSQL may only resolve the child
tables involved in a query by using a mechanism specific to the
partitioning infrastructure. I then considered the role of the rules
(collectively redirecting inserts from P to P0 ... Pn), even though
they are concerned with insert statements. By dropping the rule
associated with a given child table before attempting the "alter
table" operation on that table, then recreating the rule, it would
appear that the issues with lock acquisition disappear.

It makes sense that, if operating on a specific child table, the links
to the parent should be broken temporarily in order to isolate it from
the parent and any operations which may involve all children (or even
the checking of the involvement of all children), and to not realise
this may have been an oversight on my part. Can anyone help me to
refine my thinking further on this matter?

Paul


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Paul Boddie <paul(at)boddie(dot)org(dot)uk>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Deadlock when updating table partitions (and presumed solution)
Date: 2007-12-05 04:00:15
Message-ID: 11549.1196827215@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Paul Boddie <paul(at)boddie(dot)org(dot)uk> writes:
> I have one process querying a table P with partitions P0, P1, P2, ...
> Pn joined with table R as follows:
> select * from R inner join P on R.id = P.id and P.section = 5
> ...
> I have another process performing updates to individual partitions of
> P - specifically "alter table" operations adding foreign key
> constraints referencing R as follows:
> alter table Pm add constraint Pm_fk_id foreign key(id) references
> R(id)

Yeah, this is a problem. The SELECT will acquire AccessShareLock
on R and P, and subsequently try to acquire AccessShareLock on all
the inheritance children of P (and I don't think the order in which
these locks are acquired is very clear). Meanwhile the ALTER acquires
AccessExclusiveLock on Pm and R --- probably in that order, though
I'd not really want to promise that ordering either. So the potential
for deadlock is obvious.

You seem to be hoping that the SELECT would avoid acquiring lock
on child tables Pn that it didn't need to access, but this cannot be:
it has to get at least AccessShareLock on those tables before it can
even examine their constraints to find out that they don't need to be
scanned. And even if it could magically not take those locks, the
deadlock condition still exists with regard to the child table that
it *does* need to access.

I guess I'm wondering why you need to be adding foreign key constraints
during live operations.

regards, tom lane


From: Paul Boddie <paul(at)boddie(dot)org(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Deadlock when updating table partitions (and presumed solution)
Date: 2007-12-05 14:15:46
Message-ID: ef83f898-0912-480c-b63a-b968aa81e1ab@y5g2000hsf.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 5 Des, 05:00, t(dot)(dot)(dot)(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote:
>
> Yeah, this is a problem. The SELECT will acquire AccessShareLock
> on R and P, and subsequently try to acquire AccessShareLock on all
> the inheritance children of P (and I don't think the order in which
> these locks are acquired is very clear). Meanwhile the ALTER acquires
> AccessExclusiveLock on Pm and R --- probably in that order, though
> I'd not really want to promise that ordering either. So the potential
> for deadlock is obvious.

Indeed.

> You seem to be hoping that the SELECT would avoid acquiring lock
> on child tables Pn that it didn't need to access, but this cannot be:
> it has to get at least AccessShareLock on those tables before it can
> even examine their constraints to find out that they don't need to be
> scanned. And even if it could magically not take those locks, the
> deadlock condition still exists with regard to the child table that
> it *does* need to access.

Understood. I was really wondering whether the SELECT would be able to
acquire locks on child tables at the same time as it acquired the lock
on the parent table, but I suppose this isn't an atomic operation: it
first has to acquire a lock to be able to see the constraints; then it
finds referenced tables and attempts to acquire locks on them.

> I guess I'm wondering why you need to be adding foreign key constraints
> during live operations.

This was just some impatience on my part while updating my database: I
was merely inspecting some data which I knew resided in some
partitions whilst some other partitions were being altered. Obviously,
the database system cannot know that some data of interest isn't going
to be found in some partition without checking the properties of that
partition. Consequently, it made sense for me to exclude such
partitions from consideration by the SELECT in order to help it reach
the requested data whilst keeping it out of the way of the alteration
activities.

I suppose the lingering question is this: what constraints should I
drop in order to avoid such problems? Dropping the insert rule from
the parent table for each child table being altered *seems* to
diminish the possibility of deadlock, in that my tests produced no
deadlock situations when I adopted this approach (whereas such
situations were unavoidable before adopting this approach), but
shouldn't I actually be removing the check constraints from the child
tables instead? The manual for 8.1 says that "constraint exclusion is
driven only by CHECK constraints", but my intuition tells me that the
SELECT should initially be driven by the mere existence of tables
inheriting from the parent table and that the insert rules should have
little or nothing to do with it.

Paul