postmaster 8.2 eternally hangs in sempaphore lock acquiring

Lists: pgsql-bugs
From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 14:08:10
Message-ID: 20070329140810.GE13287@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi,

Since our Launchpad developers switched from 8.1 to 8.2.3, they often
encounter a situation when the postmaster gets stuck and needs to be
restarted. This happens on various CREATE commands (FUNCTION,
DATABASE, not consistently).

The backtraces show that the process doing the CREATION, another
client connection that InitPostgres(), and the autovacuum daemon all
hang eternally in PGSemaphoreLock(), so I suspect that there is a race
condition of some sort?

I tried to reproduce this with an hour-long run of CREATE/DROP
DATABASE, but without success.

It occurred to me that the process which does the CREATE is in
_bt_vacuum_cycleid(), is that also related to the autovacuum daemon?
I asked the original reporter (Mark Shuttleworth, in CC) to disable
the autovacuum daemon for testing.

https://launchpad.net/bugs/93042 has symbolic gdb backtraces of all
three processes that are involved.

At this point I am not sure which further information I could provide.
Mark would be happy provide more information and/or directly talk with
you.

Thanks in advance,

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 15:36:41
Message-ID: 460BDD09.2010701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt wrote:
> Since our Launchpad developers switched from 8.1 to 8.2.3, they often
> encounter a situation when the postmaster gets stuck and needs to be
> restarted. This happens on various CREATE commands (FUNCTION,
> DATABASE, not consistently).
>
> The backtraces show that the process doing the CREATION, another
> client connection that InitPostgres(), and the autovacuum daemon all
> hang eternally in PGSemaphoreLock(), so I suspect that there is a race
> condition of some sort?

So it seems.. :(

> I tried to reproduce this with an hour-long run of CREATE/DROP
> DATABASE, but without success.
>
> It occurred to me that the process which does the CREATE is in
> _bt_vacuum_cycleid(), is that also related to the autovacuum daemon?
> I asked the original reporter (Mark Shuttleworth, in CC) to disable
> the autovacuum daemon for testing.

_bt_vacuum_cycleid is always called on page splits, but it is related to
vacuums. All vacuums, not to autovacuum in particular. If you're going
to run more tests, I'd suggest running VACUUM in a loop in the
background, to see if that makes the bug occur.

> https://launchpad.net/bugs/93042 has symbolic gdb backtraces of all
> three processes that are involved.
>
> At this point I am not sure which further information I could provide.
> Mark would be happy provide more information and/or directly talk with
> you.

Thanks.

I can't see how a BtreeVacuumLock-related deadlock could occur, we never
acquire any other lock while holding it. I'll try to reproduce this on
my laptop..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: martin(at)piware(dot)de
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 17:32:06
Message-ID: 460BF816.5020205@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt wrote:
> Since our Launchpad developers switched from 8.1 to 8.2.3, they often
> encounter a situation when the postmaster gets stuck and needs to be
> restarted. This happens on various CREATE commands (FUNCTION,
> DATABASE, not consistently).
>
> The backtraces show that the process doing the CREATION, another
> client connection that InitPostgres(), and the autovacuum daemon all
> hang eternally in PGSemaphoreLock(), so I suspect that there is a race
> condition of some sort?
>
> I tried to reproduce this with an hour-long run of CREATE/DROP
> DATABASE, but without success.
>
> It occurred to me that the process which does the CREATE is in
> _bt_vacuum_cycleid(), is that also related to the autovacuum daemon?
> I asked the original reporter (Mark Shuttleworth, in CC) to disable
> the autovacuum daemon for testing.

Ok, I think I know what's happening. In btbulkdelete we have a
PG_TRY-CATCH block. In the try-block, we call _bt_start_vacuum which
acquires and releases the BtreeVacuumLock. Under certain error
conditions, _bt_start_vacuum calls elog(ERROR) while holding the
BtreeVacuumLock. The PG_CATCH block calls _bt_end_vacuum which also
tries to acquire BtreeVacuumLock.

If those error conditions in _bt_start_vacuum is triggered, we'll enter
the PG_CATCH-block and _bt_end_vacuum while already holding
BtreeVacuumLock -> deadlock.

The two error conditions in _bt_start_vacuum that throw elog(ERROR) are
the case when there's multiple vacuums for a single index, and running
out of vacuum slots. Neither scenario should ever happen, but maybe
there's a bug related to CREATE DATABASE that triggers it.

I'll investigate this more this evening or tomorrow...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 17:49:23
Message-ID: 25275.1175190563@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt <martin(at)piware(dot)de> writes:
> https://launchpad.net/bugs/93042 has symbolic gdb backtraces of all
> three processes that are involved.

Are these really all the processes involved? The createdb process and
the autovac process are both waiting for someone else to give up the
BtreeVacuumLock, but that is never held for any long period, and it's
certainly not held by the guy trying to do InitPostgres.

I believe that the guy trying to do InitPostgres is blocked by the
createdb process --- it looks like he's trying to attach to the same
DB being used as a template for the createdb, and as of 8.2 we lock out
new entries to a template DB until the copy is complete.

It's possible that this is not a deadlock per se, but the aftermath of
someone having errored out without releasing the BtreeVacuumLock --- but
I don't entirely see how that could happen either, at least not without
a core dump scenario.

Is there anything in the postmaster log when this happens? Errors out
of _bt_start_vacuum would be particularly interesting...

regards, tom lane


From: Martin Pitt <martin(dot)pitt(at)ubuntu(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Mark Shuttleworth <mark(at)ubuntu(dot)com>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 18:12:53
Message-ID: 20070329181253.GF13287@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi Tom, hi Mark,

Tom, thank you for having a look into this!

Tom Lane [2007-03-29 13:49 -0400]:
> Martin Pitt <martin(at)piware(dot)de> writes:
> > https://launchpad.net/bugs/93042 has symbolic gdb backtraces of all
> > three processes that are involved.
>
> Are these really all the processes involved? The createdb process and
> the autovac process are both waiting for someone else to give up the
> BtreeVacuumLock, but that is never held for any long period, and it's
> certainly not held by the guy trying to do InitPostgres.

There are more processes, unfortunately I don't have backtraces of
them. I got this from my IRC log:

15928 ? Ss 0:00 postgres: mark launchpad_ftest_template [local] CREATE DATABASE
15956 ? Ss 0:00 postgres: session session_dev [local] idle
15957 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
15958 ? Ss 0:00 postgres: session session_dev [local] idle
15969 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
16014 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
16273 ? Ss 0:00 postgres: mark launchpad_ftest_template [local] startup waiting

> I believe that the guy trying to do InitPostgres is blocked by the
> createdb process --- it looks like he's trying to attach to the same
> DB being used as a template for the createdb, and as of 8.2 we lock out
> new entries to a template DB until the copy is complete.
>
> It's possible that this is not a deadlock per se, but the aftermath of
> someone having errored out without releasing the BtreeVacuumLock --- but
> I don't entirely see how that could happen either, at least not without
> a core dump scenario.
>
> Is there anything in the postmaster log when this happens? Errors out
> of _bt_start_vacuum would be particularly interesting...

I believe Mark's postgres runs with fully verbose logging. Mark, can you please
have a look?

Thanks,

Martin
--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Mark Shuttleworth <mark(at)ubuntu(dot)com>
To: pgsql-bugs(at)postgresql(dot)org, Mark Shuttleworth <mark(at)ubuntu(dot)com>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 18:20:58
Message-ID: 460C038A.60806@ubuntu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Martin Pitt wrote:
> Hi Tom, hi Mark,
>
> Tom, thank you for having a look into this!
>
> Tom Lane [2007-03-29 13:49 -0400]:
>
>> Martin Pitt <martin(at)piware(dot)de> writes:
>>
>>> https://launchpad.net/bugs/93042 has symbolic gdb backtraces of all
>>> three processes that are involved.
>>>
>> Are these really all the processes involved? The createdb process and
>> the autovac process are both waiting for someone else to give up the
>> BtreeVacuumLock, but that is never held for any long period, and it's
>> certainly not held by the guy trying to do InitPostgres.
>>
>
> There are more processes, unfortunately I don't have backtraces of
> them. I got this from my IRC log:
>
> 15928 ? Ss 0:00 postgres: mark launchpad_ftest_template [local] CREATE DATABASE
> 15956 ? Ss 0:00 postgres: session session_dev [local] idle
> 15957 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
> 15958 ? Ss 0:00 postgres: session session_dev [local] idle
> 15969 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
> 16014 ? Ss 0:00 postgres: launchpad launchpad_dev [local] idle
> 16273 ? Ss 0:00 postgres: mark launchpad_ftest_template [local] startup waiting
>
>
>> I believe that the guy trying to do InitPostgres is blocked by the
>> createdb process --- it looks like he's trying to attach to the same
>> DB being used as a template for the createdb, and as of 8.2 we lock out
>> new entries to a template DB until the copy is complete.
>>
>> It's possible that this is not a deadlock per se, but the aftermath of
>> someone having errored out without releasing the BtreeVacuumLock --- but
>> I don't entirely see how that could happen either, at least not without
>> a core dump scenario.
>>
>> Is there anything in the postmaster log when this happens? Errors out
>> of _bt_start_vacuum would be particularly interesting...
>>
>
> I believe Mark's postgres runs with fully verbose logging. Mark, can you please
> have a look?
>

Yes, we run with lots of logging, what can I send you that would help?
In this case I think I've probably drowned the relevant stuff in noise,
but assuming the autovaccum=off change does not stop it from happening
again, I will know what to send you when it re-occurs.

This sort of "getting stuck" has been happening pretty consistently for
me with out test suite and pg8.2. The test suite does a lot of DB
creation, hammering, tearing down and then creation again.

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-29 18:38:07
Message-ID: 26770.1175193487@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I wrote:
> It's possible that this is not a deadlock per se, but the aftermath of
> someone having errored out without releasing the BtreeVacuumLock --- but
> I don't entirely see how that could happen either, at least not without
> a core dump scenario.

On closer inspection, the autovac stack trace

#4 0x080abe38 in _bt_end_vacuum (rel=0xb5f0b298) at nbtutils.c:1028
#5 0x080a9c68 in btbulkdelete (fcinfo=0xbfc58cd8) at nbtree.c:552

suggests that _bt_end_vacuum is called from the CATCH part of
btbulkdelete, and that provides an idea: if either of the elog(ERROR)
calls in _bt_start_vacuum were to actually fire, it would throw control
without having released BtreeVacuumLock, and then _bt_end_vacuum would
hang up. _bt_start_vacuum is coded on the assumption that the LWLock
would get released by transaction abort cleanup, but we'd fail before
getting there. So this is definitely a bug, but the next question is
what's triggering it --- both of those elogs should be "can't happen"
conditions.

> Is there anything in the postmaster log when this happens?

I repeat that with more urgency. Do you see any
"multiple active vacuums for index \"%s\"" or "out of btvacinfo slots"
log messages when these hangups occur?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: martin(at)piware(dot)de, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-30 00:19:23
Message-ID: 7118.1175213963@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Ok, I think I know what's happening. In btbulkdelete we have a
> PG_TRY-CATCH block. In the try-block, we call _bt_start_vacuum which
> acquires and releases the BtreeVacuumLock. Under certain error
> conditions, _bt_start_vacuum calls elog(ERROR) while holding the
> BtreeVacuumLock. The PG_CATCH block calls _bt_end_vacuum which also
> tries to acquire BtreeVacuumLock.

This is definitely a bug (I unfortunately didn't see your message until
after I'd replicated your reasoning...) but the word from Shuttleworth
is that he doesn't see either of those messages in his postmaster log.
So it seems we need another theory. I haven't a clue at the moment though.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: martin(at)piware(dot)de, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-30 07:57:04
Message-ID: 460CC2D0.7040806@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Ok, I think I know what's happening. In btbulkdelete we have a
>> PG_TRY-CATCH block. In the try-block, we call _bt_start_vacuum which
>> acquires and releases the BtreeVacuumLock. Under certain error
>> conditions, _bt_start_vacuum calls elog(ERROR) while holding the
>> BtreeVacuumLock. The PG_CATCH block calls _bt_end_vacuum which also
>> tries to acquire BtreeVacuumLock.
>
> This is definitely a bug (I unfortunately didn't see your message until
> after I'd replicated your reasoning...) but the word from Shuttleworth
> is that he doesn't see either of those messages in his postmaster log.
> So it seems we need another theory. I haven't a clue at the moment though.

The error message never makes it to the log. The deadlock occurs in the
PG_CATCH-block, before rethrowing and printing the error. I added an
unconditional elog(ERROR) in _bt_start_vacuum to test it, and I'm
getting the same hang with no message in the log.

The unsafe elog while holding a lwlock pattern in _bt_vacuum_start needs
to be fixed, patch attached. We still need to figure out what's causing
the error in the first place. With the patch, we should at least get a
proper error message and not hang when the error occurs.

Martin: Would it be possible for you to reproduce the problem with a
patched version?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
vacuum_cycle_hang.patch text/x-diff 1.1 KB

From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-30 08:40:28
Message-ID: 20070330084028.GE5605@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi Heikki,

Heikki Linnakangas [2007-03-30 8:57 +0100]:
> Martin: Would it be possible for you to reproduce the problem with a
> patched version?

I cannot reproduce the problem myself, but I can easily build a
package with this patch, hand it to Mark, and ask him to test it.

Thanks a lot!

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: martin(at)piware(dot)de, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: postmaster 8.2 eternally hangs in sempaphore lock acquiring
Date: 2007-03-30 14:26:13
Message-ID: 18580.1175264773@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> This is definitely a bug (I unfortunately didn't see your message until
>> after I'd replicated your reasoning...) but the word from Shuttleworth
>> is that he doesn't see either of those messages in his postmaster log.
>> So it seems we need another theory. I haven't a clue at the moment though.

> The error message never makes it to the log. The deadlock occurs in the
> PG_CATCH-block, before rethrowing and printing the error.

Oh, of course, I was thinking of the old behavior (still used for fatal
errors), but you're quite right.

> The unsafe elog while holding a lwlock pattern in _bt_vacuum_start needs
> to be fixed, patch attached. We still need to figure out what's causing
> the error in the first place. With the patch, we should at least get a
> proper error message and not hang when the error occurs.

Agreed, that is the next step to take. I spent quite some time last
night trying to reproduce the failure with no luck (I was using one
process doing create/drop database, one transiently attaching to the
template database, and one doing vacuums). So I suspect there is an
additional contributing factor in Mark's test case that we haven't
identified yet.

regards, tom lane