Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Dave Page <dave(dot)page(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, CM Team <cm(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, bernd(dot)helmle(at)credativ(dot)de
Subject: Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Date: 2014-10-03 17:09:28
Message-ID: CA+TgmoaGF1_BQNUW-nyytS1fb1w-5cscor-RBuGogH_hguNmXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 1, 2014 at 11:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> As far as I can tell, it's configured to run everything. I just
> checked, though, and found it wedged again. I'm not sure whether it
> was the same problem, though; I ended up just killing all of the
> postgres processes to fix it. We may be just at the beginning of an
> exciting debugging journey.

After much more time spent on this, and some help from two people
whose names are both six letters starting with "Andre", I discovered
the following:

- anole apparently had not tried to build master in a long time - I
think since May - because of a leftover "inst" directory instead
HEAD's buildroot.
- anole still refused to build either 9.4 or master until I removed a
couple of leftover socket directories in /tmp.
- after that it hung again trying to run the test_shm_mq regression
tests, in pretty much the same way as before.

The postmaster state looks like this:

0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0 0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc00000000045f950:0 in select+0x90 () from /usr/lib/hpux64/libc.so.1
#2 0x4000000000a4e510:0 in ServerLoop () at postmaster.c:1540
#3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)
at postmaster.c:1223
#4 0x40000000008d0a90:0 in main (argc=3, argv=0x87ffffffffffd178)
at main.c:227
(gdb) p BackgroundWorkerList
$1 = {head = {next = 0x60000000000e8a58}}
(gdb) p BackgroundWorkerList.head
$2 = {next = 0x60000000000e8a58}
(gdb) p ((char *) BackgroundWorkerList.head) - 264
$3 = 0x60000000000e8950 "test_shm_mq"
(gdb) p (RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264)
$4 = (struct RegisteredBgWorker *) 0x60000000000e8950
(gdb) p ((RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264))[0]
$5 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001?\000 \000\00511985", '\000'
<repeats 13 times>, "\005T\000\000\000\000\000\001?\000
\000\00512038\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\005t\000\000\000\000\000\001?\000
\000\00511917", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051209",
bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511903", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051199", bgw_main_arg =
1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0,
rw_pid = 12369, rw_child_slot = 35, rw_crashed_at = 0, rw_shmem_slot = 6,
rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e8938}}
(gdb) p $5.rw_lnode
$6 = {next = 0x60000000000e8938}
(gdb) p ((RegisteredBgWorker *) (((char *) $5.rw_lnode) - 264))[0]
$7 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001?x\000 \000\00511931", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?}\000
\000\00511939\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\004?\000\000\000\000\000\001?\000
\000\00511957", '\000' <repeats 13 times>,
"\004?\000\000\000\000\000\001?\000 \000\0051209",
bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511956", '\000' <repeats 13 times>,
"\005\024\000\000\000\000\000\001?\000 \000\0051192", bgw_main_arg =
1069289198, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0,
rw_shmem_slot = 5, rw_terminate = 0 '\000', rw_lnode = {
next = 0x60000000000e8818}}
(gdb) p ((RegisteredBgWorker *) (((char *) $7.rw_lnode) - 264))[0]
$8 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001?f\000 \000\00512051", '\000'
<repeats 13 times>, "\0044\000\000\000\000\000\001?g\000
\000\00512047\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\004T\000\000\000\000\000\001?j\000
\000\00511979", '\000' <repeats 13 times>,
"\004d\000\000\000\000\000\001?k\000 \000\0051197",
bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?m\000 \000\00511873", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?u\000
\000\0051196", bgw_main_arg = 1069289198, bgw_notify_pid = 12361},
rw_backend = 0x0,
rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 4,
rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e86f8}}
(gdb) p ((RegisteredBgWorker *) (((char *) $8.rw_lnode) - 264))[0]
$9 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001?\\\000 \000\00512085", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?]\000
\000\00512076\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\003?\000\000\000\000\000\001?_\000
\000\00511885", '\000' <repeats 13 times>,
"\003?\000\000\000\000\000\001?`\000 \000\0051206",
bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?a\000 \000\00512065", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?c\000
\000\0051188", bgw_main_arg = 695156898, bgw_notify_pid = 12361},
rw_backend = 0x0,
rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665468967706,
rw_shmem_slot = 3, rw_terminate = 0 '\000', rw_lnode = {
next = 0x60000000000e85d8}}
(gdb) p ((RegisteredBgWorker *) (((char *) $9.rw_lnode) - 264))[0]
$10 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001?\f\000 \000\00512045", '\000'
<repeats 13 times>, "\003\024\000\000\000\000\000\001?\016\000
\000\00512049\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\0034\000\000\000\000\000\001?=\000
\000\00512057", '\000' <repeats 13 times>,
"\003D\000\000\000\000\000\001?A\000 \000\0051208",
bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?B\000 \000\00512092", '\000'
<repeats 13 times>, "\003d\000\000\000\000\000\001?C\000
\000\0051192", bgw_main_arg = 507386864, bgw_notify_pid = 12361},
rw_backend = 0x0,
rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665463586145,
rw_shmem_slot = 2, rw_terminate = 0 '\000', rw_lnode = {
next = 0x60000000000e84b8}}
(gdb) p ((RegisteredBgWorker *) (((char *) $10.rw_lnode) - 264))[0]
$11 = {rw_worker = {
bgw_name = "test_shm_mq\000\000\001??\000 \000\00511937", '\000'
<repeats 13 times>, "\002?\000\000\000\000\000\001??\000
\000\00511941\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\000\002?\000\000\000\000\000\001?\000\000
\000\00511950", '\000' <repeats 13 times>,
"\002?\000\000\000\000\000\001?\002\000 \000\0051195",
bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\a\000 \000\00511960",
'\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\b\000
\000\0051198", bgw_main_arg = 1747912200, bgw_notify_pid = 12361},
rw_backend = 0x0,
rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462962221,
rw_shmem_slot = 1, rw_terminate = 0 '\000', rw_lnode = {
next = 0x60000000000e8398}}
(gdb) p ((RegisteredBgWorker *) (((char *) $11.rw_lnode) - 264))[0]
$12 = {rw_worker = {
bgw_name = "test_shm_mq", '\000' <repeats 11 times>, "
\000\000\000\000\000\000\000 ", '\000' <repeats 23 times>,
"\021?\000\000\000\000\000\001??",
bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState,
bgw_restart_time = -1, bgw_main = 0,
bgw_library_name =
"test_shm_mq\000\000\001??\000(\000\t12115_fsm", '\000' <repeats 17
times>, "\022(\000\000\000\000\000\001??\0000\000\020pg_i",
bgw_function_name = "test_shm_mq_main", '\000' <repeats 14 times>,
"\0228\000\000\000\000\000\001?\003\000 \000\00552236", '\000'
<repeats 13 times>, "\022H", bgw_main_arg = 502019512, bgw_notify_pid
= 12361}, rw_backend = 0x0,
rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462860980,
rw_shmem_slot = 0, rw_terminate = 0 '\000', rw_lnode = {next = 0x0}}
(gdb) fr 3
#3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)
at postmaster.c:1223
1223 status = ServerLoop();
(gdb) p StartWorkerNeeded
$13 = 1 '\001'

So apparently commit a39e78b710eb588e102aedd2828611d7bc74714b, a good
idea though it certainly was, wasn't the root cause of this issue -
because the symptoms now seem identical to what I saw debugging
earlier hangs, but I verified that those changes are present here.

Further debugging reveals that sigusr1_handler() gets called
repeatedly, to start autovacuum workers, and it keeps waking up and
starting them. But that doesn't cause the background workers to get
started either, because although sigusr1_handler() contains a call to
maybe_start_bgworker, it only does that if start_bgworker = true,
which only happens if
CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
which on these calls it isn't.
And I think this might also be the missing ingredient answering
Andres's question from before: why doesn't the 60-second
select()-timeout cause the background worker to eventually start even
if the SELECT doesn't get interrupted? There seems to be a SIGUSR1
arriving about every 3 seconds, and I bet that's resetting the timer
every time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-10-03 17:15:00 Re: replicating DROP commands across servers
Previous Message Stephen Frost 2014-10-03 17:08:36 Re: replicating DROP commands across servers