Hung postmaster (8.3.9)

Lists: pgsql-general
From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Hung postmaster (8.3.9)
Date: 2010-02-28 03:37:04
Message-ID: 201002272037.04844.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Need some help.

My PostgreSQL 8.3.6 and now 8.3.9 postmaster is hanging.
No idea why, been running like a top for a year.
Can't do "select version()", even hung after system reboot.
SIGINT/QUIT/TERM have no effect, only SIGKILL can stop it.

This is Linux 2.6.18-92.1.22.el5 SMP x86_64

Here's the log after restarting from SIGKILL:

2010-02-27 20:11:10.426 CST [23134] LOG: database system was interrupted; last known up at 2010-02-27 20:08:46 CST
2010-02-27 20:11:10.426 CST [23134] DEBUG: checkpoint record is at 8BC/63F5FC0
2010-02-27 20:11:10.426 CST [23134] DEBUG: redo record is at 8BC/63F5FC0; shutdown TRUE
2010-02-27 20:11:10.426 CST [23134] DEBUG: next transaction ID: 0/488483782; next OID: 1619873999
2010-02-27 20:11:10.426 CST [23134] DEBUG: next MultiXactId: 130401682; next MultiXactOffset: 830996786
2010-02-27 20:11:10.426 CST [23134] LOG: database system was not properly shut down; automatic recovery in progress
2010-02-27 20:11:10.497 CST [23134] LOG: record with zero length at 8BC/63F6020
2010-02-27 20:11:10.497 CST [23134] LOG: redo is not required
2010-02-27 20:11:10.503 CST [23134] LOG: checkpoint starting: shutdown immediate
2010-02-27 20:11:10.581 CST [23134] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.000 s, total=0.083 s
2010-02-27 20:11:10.587 CST [23134] DEBUG: transaction ID wrap limit is 2435542396, limited by database "ohsdba"
2010-02-27 20:11:10.589 CST [23134] DEBUG: proc_exit(0)
2010-02-27 20:11:10.589 CST [23134] DEBUG: shmem_exit(0)
2010-02-27 20:11:10.592 CST [23134] DEBUG: exit(0)

Here's the backtrace:

$ gdb `which postgres`
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(gdb) attach 3776
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 3776
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
(gdb) p debug_query_string
$1 = 0x0
(gdb) bt
#0 0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
#1 0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
#2 0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3 0x000000346f8689cb in getc () from /lib64/libc.so.6
#4 0x0000000000531ee8 in next_token (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at hba.c:128
#5 0x0000000000532233 in tokenize_file (filename=0x5b8f3f0 "global", file=0x5b90f20, lines=0x7fff59bef5c8,
line_nums=0x7fff59bef5c0) at hba.c:232
#6 0x00000000005322e9 in tokenize_file (filename=0x5b8f3d0 "global/pg_auth", file=0x5b90ce0, lines=0x98b168, line_nums=0x98b170)
at hba.c:358
#7 0x00000000005327ff in load_role () at hba.c:959
#8 0x000000000057f300 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2145
#9 <signal handler called>
#10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
#11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
#12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x5b73fe0) at postmaster.c:1031
#13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188
(gdb) quit
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 3776

Thanks in advance for any help.

Ed


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Hung postmaster (8.3.9)
Date: 2010-02-28 05:18:41
Message-ID: 4B89FCB1.6050904@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Ed L. wrote:
> 2010-02-27 20:11:10.426 CST [23134] LOG: database system was not properly shut down; automatic recovery in progress
> 2010-02-27 20:11:10.497 CST [23134] LOG: record with zero length at 8BC/63F6020
> 2010-02-27 20:11:10.497 CST [23134] LOG: redo is not required
>

If the server is hung at this point, it was doing something: recovery
from a crash, and during that period it will not respond to statements
until it's finished as you noted. That can take some time, could run
into the minutes. And if you kill it, next time the process starts all
over again having made no progress.

I'd try starting the server again, confirm it's only after this point it
pauses at, watching if it's doing something with "top -c", and grab some
backtraces of what it's doing then. In this sample you gave, the
shutdown was too fast for the server have done very recovery before it
was killed off. It may just be you need to let it sit there for a while
to finish recovery and then it will be fine again, can't tell from the
data you've provided so far.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us


From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Hung postmaster (8.3.9)
Date: 2010-02-28 06:15:23
Message-ID: 201002272315.23472.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Saturday 27 February 2010 @ 22:18, Greg Smith wrote:
> Ed L. wrote:
> > 2010-02-27 20:11:10.426 CST [23134] LOG: database system
> > was not properly shut down; automatic recovery in progress
> > 2010-02-27 20:11:10.497 CST [23134] LOG: record with
> > zero length at 8BC/63F6020 2010-02-27 20:11:10.497 CST
> > [23134] LOG: redo is not required
>
> If the server is hung at this point, it was doing something:
> recovery from a crash, and during that period it will not
> respond to statements until it's finished as you noted. That
> can take some time, could run into the minutes. And if you
> kill it, next time the process starts all over again having
> made no progress.
>
> I'd try starting the server again, confirm it's only after
> this point it pauses at, watching if it's doing something
> with "top -c", and grab some backtraces of what it's doing
> then. In this sample you gave, the shutdown was too fast for
> the server have done very recovery before it was killed off.
> It may just be you need to let it sit there for a while to
> finish recovery and then it will be fine again, can't tell
> from the data you've provided so far.
>

Good point from my example, but it hung like that for 8 hours,
and seems to be ready to do so again.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Hung postmaster (8.3.9)
Date: 2010-02-28 15:58:13
Message-ID: 8705.1267372693@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Ed L." <pgsql(at)bluepolka(dot)net> writes:
> (gdb) bt
> #0 0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
> #1 0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
> #2 0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
> #3 0x000000346f8689cb in getc () from /lib64/libc.so.6
> #4 0x0000000000531ee8 in next_token (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at hba.c:128
> #5 0x0000000000532233 in tokenize_file (filename=0x5b8f3f0 "global", file=0x5b90f20, lines=0x7fff59bef5c8,
> line_nums=0x7fff59bef5c0) at hba.c:232
> #6 0x00000000005322e9 in tokenize_file (filename=0x5b8f3d0 "global/pg_auth", file=0x5b90ce0, lines=0x98b168, line_nums=0x98b170)
> at hba.c:358
> #7 0x00000000005327ff in load_role () at hba.c:959
> #8 0x000000000057f300 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2145
> #9 <signal handler called>
> #10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
> #11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
> #12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x5b73fe0) at postmaster.c:1031
> #13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188

The postmaster seems to be stuck trying to read $PGDATA/global/pg_auth
(which would be an expected thing for it to do at this point in the
startup sequence). Does that file exist? Is it an ordinary file?
Do its contents look sane (a list of your userids and their passwords
and group memberships)?

regards, tom lane