9.1.3 backends getting stuck in 'startup'

Lists: pgsql-bugs
From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 16:07:46
Message-ID: 4F9AC452.3050102@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
get stuck in 'startup' mode. By that I mean the set_ps_output mode. Postgres
is installed via Martin Pitt's packages.

It manifests like this:

Server has been humming along fine, then suddenly many backends get stuck in
'startup' mode and the ps output looks like so:

postgres 2425 3.5 0.3 8695020 483868 ? Ss 08:56 0:41 postgres: consprod consprod 192.168.1.150(41606) startup
postgres 2427 4.1 0.3 8695020 511756 ? Ss 08:56 0:48 postgres: consprod consprod 192.168.1.152(60876) startup
postgres 2429 3.5 0.3 8695020 497252 ? Ss 08:56 0:40 postgres: consprod consprod 192.168.1.151(41930) startup
postgres 2431 3.7 0.3 8695020 502004 ? Ss 08:56 0:43 postgres: consprod consprod 192.168.1.151(41933) startup
postgres 2435 3.8 0.3 8695020 439080 ? Ss 08:56 0:44 postgres: consprod consprod 192.168.1.157(37656) startup
postgres 2437 3.5 0.3 8695020 482304 ? Ss 08:56 0:41 postgres: consprod consprod 192.168.1.153(45071) startup
postgres 2438 3.4 0.3 8695020 506776 ? Ss 08:56 0:39 postgres: consprod consprod 192.168.1.153(45072) startup
postgres 2441 3.7 0.3 8695020 495652 ? Ss 08:56 0:43 postgres: consprod consprod 192.168.1.153(45075) startup
postgres 2442 3.7 0.3 8695020 503336 ? Ss 08:56 0:43 postgres: consprod consprod 192.168.1.153(45076) startup
postgres 2443 4.0 0.3 8695020 511760 ? Ss 08:56 0:46 postgres: consprod consprod 192.168.1.153(45077) startup
postgres 2445 3.4 0.3 8695020 478632 ? Ss 08:56 0:39 postgres: consprod consprod 192.168.1.155(54413) startup
postgres 2446 3.7 0.3 8695020 464380 ? Ss 08:56 0:43 postgres: consprod consprod 192.168.1.154(46186) startup
postgres 2448 3.8 0.3 8695024 474400 ? Ss 08:56 0:44 postgres: consprod consprod 192.168.1.154(46187) startup
postgres 2449 3.8 0.3 8695020 464932 ? Rs 08:56 0:45 postgres: consprod consprod 192.168.1.154(46216) startup
postgres 2450 3.3 0.3 8695020 515200 ? Ss 08:56 0:38 postgres: consprod consprod 192.168.1.152(60890) startup
postgres 2452 3.3 0.3 8695020 493384 ? Ss 08:56 0:38 postgres: consprod consprod 192.168.1.152(60892) startup
postgres 2453 3.7 0.3 8695020 490168 ? Ss 08:56 0:44 postgres: consprod consprod 192.168.1.152(60893) startup
postgres 2456 3.4 0.3 8695020 449348 ? Ss 08:56 0:40 postgres: consprod consprod 192.168.1.156(50829) startup

Restarting required '-m immediate'.

Stracing reveals lots of semops:

semop(589837, {{3, 1, 0}}, 1) = 0
semop(524299, {{11, 1, 0}}, 1) = 0
semop(557068, {{6, 1, 0}}, 1) = 0
semop(458761, {{2, 1, 0}}, 1) = 0
semop(557068, {{9, 1, 0}}, 1) = 0
read(8, ":\n\0\0\210\261\224\316\4\0\1\0\364\0\0 \0 \4
\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
read(8, ":\n\0\0\320\262\224\316\4\0\1\0\364\0\0 \0 \4
\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

backtrace looks like this:
0x00007f7dd1d1f497 in semop () at ../sysdeps/unix/syscall-template.S:82
82 in ../sysdeps/unix/syscall-template.S
#0 0x00007f7dd1d1f497 in semop () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f7dd3d67b18 in PGSemaphoreLock (sema=0x7f7dce0e8210, interruptOK=0
'\000') at pg_sema.c:418
#2 0x00007f7dd3da67c5 in LWLockAcquire (lockid=FirstBufMappingLock,
mode=LW_SHARED) at lwlock.c:464
#3 0x00007f7dd3d91289 in BufferAlloc (foundPtr=0x7fff84d3b45e "",
strategy=0x7f7dd5f359e0, blockNum=201362, forkNum=MAIN_FORKNUM,
relpersistence=112 'p', smgr=<optimized out>) at bufmgr.c:531
#4 ReadBuffer_common (smgr=0x7f7dd5f39f00, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL,
strategy=0x7f7dd5f359e0, hit=0x7fff84d3b4cf "") at bufmgr.c:325
#5 0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
out>) at bufmgr.c:246
#6 0x00007f7dd3be3d46 in heapgetpage (scan=0x7f7dd5f351c0, page=201362) at
heapam.c:223
#7 0x00007f7dd3be4732 in heapgettup (scan=0x7f7dd5f351c0, dir=<optimized
out>, nkeys=2, key=0x7f7dd5f358a0) at heapam.c:556
#8 0x00007f7dd3be52f5 in heap_getnext (scan=0x7f7dd5f351c0,
direction=<optimized out>) at heapam.c:1345
#9 0x00007f7dd3e66167 in RelationBuildTupleDesc (relation=0x7f7dd3b23ee8) at
relcache.c:468
#10 RelationBuildDesc (targetRelId=<optimized out>, insertIt=1 '\001') at
relcache.c:882
#11 0x00007f7dd3e66bdd in load_critical_index (indexoid=2655, heapoid=2603) at
relcache.c:3080
#12 0x00007f7dd3e6839b in RelationCacheInitializePhase3 () at relcache.c:2890
#13 0x00007f7dd3e7f931 in InitPostgres (in_dbname=<optimized out>,
dboid=2228467824, username=<optimized out>, out_dbname=0x0) at postinit.c:816
#14 0x00007f7dd3db26d8 in PostgresMain (argc=<optimized out>, argv=<optimized
out>, username=0x7f7dd5e4bc90 "consprod") at postgres.c:3650
#15 0x00007f7dd3d75923 in BackendRun (port=0x7f7dd5e96ea0) at postmaster.c:3606
#16 BackendStartup (port=0x7f7dd5e96ea0) at postmaster.c:3291
#17 ServerLoop () at postmaster.c:1455
#18 0x00007f7dd3d7637c in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1116
#19 0x00007f7dd3bce5b0 in main (argc=5, argv=0x7f7dd5e4b170) at main.c:199
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00007f7dd1d1f497 in semop () at ../sysdeps/unix/syscall-template.S:82
82 in ../sysdeps/unix/syscall-template.S
#0 0x00007f7dd1d1f497 in semop () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f7dd3d67b18 in PGSemaphoreLock (sema=0x7f7dce0e8210, interruptOK=0
'\000') at pg_sema.c:418
#2 0x00007f7dd3da67c5 in LWLockAcquire (lockid=FirstBufMappingLock,
mode=LW_SHARED) at lwlock.c:464
#3 0x00007f7dd3d91289 in BufferAlloc (foundPtr=0x7fff84d3b45e "",
strategy=0x7f7dd5f359e0, blockNum=201362, forkNum=MAIN_FORKNUM,
relpersistence=112 'p', smgr=<optimized out>) at bufmgr.c:531
#4 ReadBuffer_common (smgr=0x7f7dd5f39f00, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL,
strategy=0x7f7dd5f359e0, hit=0x7fff84d3b4cf "") at bufmgr.c:325
#5 0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
out>) at bufmgr.c:246
#6 0x00007f7dd3be3d46 in heapgetpage (scan=0x7f7dd5f351c0, page=201362) at
heapam.c:223
#7 0x00007f7dd3be4732 in heapgettup (scan=0x7f7dd5f351c0, dir=<optimized
out>, nkeys=2, key=0x7f7dd5f358a0) at heapam.c:556
#8 0x00007f7dd3be52f5 in heap_getnext (scan=0x7f7dd5f351c0,
direction=<optimized out>) at heapam.c:1345
#9 0x00007f7dd3e66167 in RelationBuildTupleDesc (relation=0x7f7dd3b23ee8) at
relcache.c:468
#10 RelationBuildDesc (targetRelId=<optimized out>, insertIt=1 '\001') at
relcache.c:882
#11 0x00007f7dd3e66bdd in load_critical_index (indexoid=2655, heapoid=2603) at
relcache.c:3080
#12 0x00007f7dd3e6839b in RelationCacheInitializePhase3 () at relcache.c:2890
#13 0x00007f7dd3e7f931 in InitPostgres (in_dbname=<optimized out>,
dboid=2228467824, username=<optimized out>, out_dbname=0x0) at postinit.c:816
#14 0x00007f7dd3db26d8 in PostgresMain (argc=<optimized out>, argv=<optimized
out>, username=0x7f7dd5e4bc90 "consprod") at postgres.c:3650
#15 0x00007f7dd3d75923 in BackendRun (port=0x7f7dd5e96ea0) at postmaster.c:3606
#16 BackendStartup (port=0x7f7dd5e96ea0) at postmaster.c:3291
#17 ServerLoop () at postmaster.c:1455
#18 0x00007f7dd3d7637c in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1116
#19 0x00007f7dd3bce5b0 in main (argc=5, argv=0x7f7dd5e4b170) at main.c:199

pg_config output:
BINDIR = /usr/lib/postgresql/9.1/bin
DOCDIR = /usr/share/doc/postgresql
HTMLDIR = /usr/share/doc/postgresql
INCLUDEDIR = /usr/include/postgresql
PKGINCLUDEDIR = /usr/include/postgresql
INCLUDEDIR-SERVER = /usr/include/postgresql/9.1/server
LIBDIR = /usr/lib
PKGLIBDIR = /usr/lib/postgresql/9.1/lib
LOCALEDIR = /usr/share/locale
MANDIR = /usr/share/postgresql/9.1/man
SHAREDIR = /usr/share/postgresql/9.1
SYSCONFDIR = /etc/postgresql-common
PGXS = /usr/lib/postgresql/9.1/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--build=x86_64-linux-gnu' '--prefix=/usr'
'--includedir=/usr/include' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
'--libexecdir=/usr/lib/postgresql-9.1' '--disable-maintainer-mode'
'--disable-dependency-tracking' '--mandir=/usr/share/postgresql/9.1/man'
'--with-docdir=/usr/share/doc/postgresql-doc-9.1'
'--sysconfdir=/etc/postgresql-common' '--datadir=/usr/share/postgresql/9.1'
'--bindir=/usr/lib/postgresql/9.1/bin' '--includedir=/usr/include/postgresql/'
'--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety'
'--enable-debug' '--disable-rpath' '--with-tcl' '--with-perl' '--with-python'
'--with-pam' '--with-krb5' '--with-gssapi' '--with-openssl' '--with-libxml'
'--with-libxslt' '--with-ldap' '--with-ossp-uuid' '--with-gnu-ld'
'--with-tclconfig=/usr/lib/tcl8.5' '--with-tkconfig=/usr/lib/tk8.5'
'--with-includes=/usr/include/tcl8.5'
'--with-system-tzdata=/usr/share/zoneinfo' '--with-pgport=5432' 'CFLAGS=-g -O2
-fPIC -fPIC' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,--as-needed
-Wl,--as-needed' 'build_alias=x86_64-linux-gnu' 'CPPFLAGS='
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5
CFLAGS = -g -O2 -fPIC -fPIC -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-Bsymbolic-functions -Wl,--as-needed -Wl,--as-needed -Wl,--as-needed
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgport -lxslt -lxml2 -lpam -lssl -lcrypto -lkrb5 -lcom_err
-lgssapi_krb5 -lz -ledit -lcrypt -ldl -lm
VERSION = PostgreSQL 9.1.3


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 16:18:41
Message-ID: 4F9AC6E1.6000102@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 09:07, Jeff Frost wrote:
> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
> get stuck in 'startup' mode. By that I mean the set_ps_output mode. Postgres
> is installed via Martin Pitt's packages.

quick followup on this..when it happens, you can connect successfully to the
postgres db, but as soon as you \c consprod, you get stuck.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 17:14:39
Message-ID: 23016.1335546879@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
> get stuck in 'startup' mode.

Well, the one you backtraced seems to be waiting for somebody else to
release one of the buffer mapping locks ... which is not a lock I'd
expect to get held long, ever. Could you check all the Postgres
processes and see which of them have backtraces different from that?
We need to figure out what's sitting on that lock and why.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 18:54:34
Message-ID: 4F9AEB6A.9060706@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 10:14, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
>> get stuck in 'startup' mode.
> Well, the one you backtraced seems to be waiting for somebody else to
> release one of the buffer mapping locks ... which is not a lock I'd
> expect to get held long, ever. Could you check all the Postgres
> processes and see which of them have backtraces different from that?
> We need to figure out what's sitting on that lock and why.
>

Thanks Tom. We've failed over to the replica which seems to have the issue
happen less often, so I'll try and get the backtraces as soon as it's reproduced.


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 19:14:06
Message-ID: 4F9AEFFE.6090600@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 11:54, Jeff Frost wrote:
> On 04/27/12 10:14, Tom Lane wrote:
>> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>>> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
>>> get stuck in 'startup' mode.
>> Well, the one you backtraced seems to be waiting for somebody else to
>> release one of the buffer mapping locks ... which is not a lock I'd
>> expect to get held long, ever. Could you check all the Postgres
>> processes and see which of them have backtraces different from that?
>> We need to figure out what's sitting on that lock and why.
>>
> Thanks Tom. We've failed over to the replica which seems to have the issue
> happen less often, so I'll try and get the backtraces as soon as it's reproduced.
Alright, found one that's a little different (at least it wasn't in InitPostgres):

#0 0x00007f252f612297 in semop () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007f253160ccc0 in PGSemaphoreLock (sema=0x7f252ba65c00, interruptOK=0
'\000') at pg_sema.c:418
#2 0x00007f253164cea5 in LWLockAcquire (lockid=<value optimized out>,
mode=LW_SHARED) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/lmgr/lwlock.c:464
#3 0x00007f253163889a in BufferAlloc (smgr=0x7f2532c79f60,
relpersistence=<value optimized out>, forkNum=<value optimized out>,
blockNum=<value optimized out>, mode=<value optimized out>, strategy=<value
optimized out>, hit=0x7fffea5f9ecf "") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:531
#4 ReadBuffer_common (smgr=0x7f2532c79f60, relpersistence=<value optimized
out>, forkNum=<value optimized out>, blockNum=<value optimized out>,
mode=<value optimized out>, strategy=<value optimized out>, hit=0x7fffea5f9ecf
"") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:325
#5 0x00007f25316391e4 in ReadBufferExtended (reln=0x7f231ee95438,
forkNum=MAIN_FORKNUM, blockNum=166364, mode=RBM_NORMAL, strategy=0xecaf9628)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:246
#6 0x00007f2531490fa3 in _bt_relandgetbuf (rel=0x7f231ee95438, obuf=77863,
blkno=1, access=1) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/nbtree/nbtpage.c:651
#7 0x00007f2531494cda in _bt_search (rel=0x7f231ee95438, keysz=1,
scankey=0x7fffea5f9ff0, nextkey=<value optimized out>, bufP=0x7fffea5fa9fc,
access=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/nbtree/nbtsearch.c:121
#8 0x00007f25314951bf in _bt_first (scan=<value optimized out>, dir=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/nbtree/nbtsearch.c:841
#9 0x00007f2531493954 in btgettuple (fcinfo=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/nbtree/nbtree.c:315
#10 0x00007f2531721003 in FunctionCall2Coll (flinfo=0x0, collation=3932134720,
arg1=1, arg2=18446744073709551615) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/fmgr/fmgr.c:1319
#11 0x00007f253148d0d5 in index_getnext (scan=0x7f2532eff820, direction=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/index/indexam.c:499
#12 0x00007f25316c392e in get_actual_variable_range (root=<value optimized
out>, vardata=<value optimized out>, sortop=<value optimized out>, min=<value
optimized out>, max=0x7f2532efc8e0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/adt/selfuncs.c:4748
#13 0x00007f25316c7969 in ineq_histogram_selectivity (root=<value optimized
out>, vardata=0x7fffea5fb250, opproc=<value optimized out>, isgt=0 '\000',
constval=<value optimized out>, consttype=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/adt/selfuncs.c:805
#14 0x00007f25316c810c in scalarineqsel (root=0x7f2532d18c00, operator=<value
optimized out>, isgt=-1 '\377', vardata=0x7fffea5fb250, constval=47095179,
consttype=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/adt/selfuncs.c:535
#15 0x00007f25316c85fc in mergejoinscansel (root=0x7f2532d18c00, clause=<value
optimized out>, opfamily=<value optimized out>, strategy=<value optimized
out>, nulls_first=0 '\000', leftstart=0x7fffea5fb468, leftend=0x7fffea5fb460,
rightstart=0x7fffea5fb458, rightend=0x7fffea5fb450) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/adt/selfuncs.c:2843
#16 0x00007f25315e11d3 in cached_scansel (path=0x7f2532efbac0,
root=0x7f2532d18c00, sjinfo=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/costsize.c:2099
#17 cost_mergejoin (path=0x7f2532efbac0, root=0x7f2532d18c00, sjinfo=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/costsize.c:1847
#18 0x00007f2531605e40 in create_mergejoin_path (root=0x7f2532d18c00,
joinrel=0x7f2532efbbf0, jointype=JOIN_LEFT, sjinfo=0x7f2532ef3180,
outer_path=0x7f2532ef6e50, inner_path=<value optimized out>,
restrict_clauses=0x7f2532efc310, pathkeys=0x0, mergeclauses=0x7f2532efc540,
outersortkeys=0x7f2532efc4f0, innersortkeys=0x7f2532efc590) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/util/pathnode.c:1554
#19 0x00007f25315e8559 in sort_inner_and_outer (root=0x7f2532d18c00,
joinrel=0x7f2532efbbf0, outerrel=<value optimized out>, innerrel=<value
optimized out>, jointype=<value optimized out>, sjinfo=<value optimized out>,
restrictlist=0x7f2532efc310) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/joinpath.c:305
#20 add_paths_to_joinrel (root=0x7f2532d18c00, joinrel=0x7f2532efbbf0,
outerrel=<value optimized out>, innerrel=<value optimized out>,
jointype=<value optimized out>, sjinfo=<value optimized out>,
restrictlist=0x7f2532efc310) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/joinpath.c:103
#21 0x00007f25315ea08b in make_join_rel (root=0x7f2532d18c00,
rel1=0x7f2532ee29d0, rel2=0x7f2532ee3b00) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/joinrels.c:642
#22 0x00007f25315ea339 in make_rels_by_clause_joins (root=0x7f2532d18c00,
level=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/joinrels.c:268
#23 join_search_one_level (root=0x7f2532d18c00, level=<value optimized out>)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/joinrels.c:99
#24 0x00007f25315dc40b in standard_join_search (root=0x7f2532d18c00,
levels_needed=8, initial_rels=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/allpaths.c:1108
#25 0x00007f25315dd7cb in make_rel_from_joinlist (root=0x7f2532d18c00,
joinlist=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/path/allpaths.c:1004
#26 0x00007f25315f3e76 in query_planner (root=0x7f2532d18c00, tlist=<value
optimized out>, tuple_fraction=0, limit_tuples=<value optimized out>,
cheapest_path=0x7fffea5fb9e0, sorted_path=<value optimized out>,
num_groups=0x7fffea5fb9e8) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/plan/planmain.c:271
#27 0x00007f25315f5933 in grouping_planner (root=0x7f2532d18c00,
tuple_fraction=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/plan/planner.c:1149
#28 0x00007f25315f7409 in subquery_planner (glob=0x7f2532d18b70,
parse=0x7f2532d97180, parent_root=0x7f2532d1d380, hasRecursion=<value
optimized out>, tuple_fraction=<value optimized out>, subroot=<value optimized
out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/plan/planner.c:539
#29 0x00007f25315f76f6 in standard_planner (parse=0x7f2532d97180,
cursorOptions=0, boundParams=0x7f2532d970e0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/optimizer/plan/planner.c:202
#30 0x00007f2531658bde in pg_plan_query (querytree=0xa0005,
cursorOptions=-362832576, boundParams=0x1) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:723
#31 0x00007f2531658cd4 in pg_plan_queries (querytrees=<value optimized out>,
cursorOptions=0, boundParams=0x7f2532d970e0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:782
#32 0x00007f253165949c in exec_bind_message (input_message=<value optimized
out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:1742
#33 0x00007f253165b275 in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, username=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:3968
#34 0x00007f2531619ad3 in BackendRun () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3606
#35 BackendStartup () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3291
#36 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1455
#37 0x00007f253161c57c in PostmasterMain (argc=851231024, argv=0x7f2532b8c8f0)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#38 0x00007f25315baec3 in main (argc=3, argv=0x7f2532b8c8e0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 19:17:44
Message-ID: 25630.1335554264@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> Alright, found one that's a little different (at least it wasn't in InitPostgres):

It's still blocking at bufmgr.c:531 though ... so AFAICS this is another
victim of somebody monopolizing a buffer mapping lock.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 19:23:48
Message-ID: 4F9AF244.80203@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 12:17, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> Alright, found one that's a little different (at least it wasn't in InitPostgres):
> It's still blocking at bufmgr.c:531 though ... so AFAICS this is another
> victim of somebody monopolizing a buffer mapping lock.
>
>
Any idea what I should be looking for in the backtraces?

I would imagine I can ignore any that are in InitPostgres, but that still
leaves quite a few to look through.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 19:27:11
Message-ID: 25861.1335554831@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> Any idea what I should be looking for in the backtraces?

> I would imagine I can ignore any that are in InitPostgres, but that still
> leaves quite a few to look through.

I think you can probably skip all that are blocked in LWLockAcquire
called from bufmgr.c:531, at least for a first pass. Calls from
elsewhere in bufmgr.c might be more interesting, and anything that's not
blocked at an LWLockAcquire at all might be even more interesting.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 19:45:10
Message-ID: 4F9AF746.9060202@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 12:27, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> Any idea what I should be looking for in the backtraces?
>> I would imagine I can ignore any that are in InitPostgres, but that still
>> leaves quite a few to look through.
> I think you can probably skip all that are blocked in LWLockAcquire
> called from bufmgr.c:531, at least for a first pass. Calls from
> elsewhere in bufmgr.c might be more interesting, and anything that's not
> blocked at an LWLockAcquire at all might be even more interesting.
>
> regards, tom lane

Here we go. One that's not blocked on acquiring the lock, but does have
bufmgr.c in it:

#0 0x00007f252f601a50 in __read_nocancel () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f253163bced in read (file=<value optimized out>,
buffer=0x7f251f3a1fa0 ":\n", amount=<value optimized out>) at
/usr/include/bits/unistd.h:45
#2 FileRead (file=<value optimized out>, buffer=0x7f251f3a1fa0 ":\n",
amount=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/file/fd.c:1191
#3 0x00007f25316547a6 in mdread (reln=<value optimized out>, forknum=<value
optimized out>, blocknum=5825, buffer=0x7f251f3a1fa0 ":\n") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/smgr/md.c:640
#4 0x00007f25316387f1 in ReadBuffer_common (smgr=0x7f2532d38310,
relpersistence=<value optimized out>, forkNum=<value optimized out>,
blockNum=<value optimized out>, mode=<value optimized out>, strategy=<value
optimized out>, hit=0x7fffea5fb47f "") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:438
#5 0x00007f25316391e4 in ReadBufferExtended (reln=0x7f2532cf6990,
forkNum=MAIN_FORKNUM, blockNum=5825, mode=RBM_NORMAL, strategy=0x4ff74188) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:246
#6 0x00007f25314817bf in heapgetpage (scan=0x7f2532d335d0, page=5825) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:223
#7 0x00007f25314821ba in heapgettup (scan=0x7f2532d335d0, dir=<value
optimized out>, nkeys=<value optimized out>, key=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:556
#8 0x00007f25314827bc in heap_getnext (scan=0x7f2532d335d0,
direction=523902880) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:1345
#9 0x00007f2531711570 in RelationBuildTupleDesc (targetRelId=<value optimized
out>, insertIt=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:468
#10 RelationBuildDesc (targetRelId=<value optimized out>, insertIt=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:882
#11 0x00007f2531712f26 in RelationIdGetRelation (relationId=2616) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1568
#12 0x00007f2531485aab in relation_open (relationId=2616, lockmode=523902880)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:907
#13 0x00007f2531485b33 in heap_open (relationId=6, lockmode=523902880) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:1077
#14 0x00007f25317106d7 in LookupOpclassInfo (operatorClassOid=10035,
numSupport=1) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1278
#15 0x00007f2531710c48 in IndexSupportInitialize (relation=<value optimized
out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1159
#16 RelationInitIndexAccessInfo (relation=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1102
#17 0x00007f2531711d91 in RelationBuildDesc (targetRelId=<value optimized
out>, insertIt=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:904
#18 0x00007f253171201d in load_critical_index (indexoid=2693, heapoid=2618) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:3080
#19 0x00007f25317124d2 in RelationCacheInitializePhase3 () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:2892
#20 0x00007f25317293e9 in InitPostgres (in_dbname=<value optimized out>,
dboid=<value optimized out>, username=<value optimized out>, out_dbname=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/init/postinit.c:816
#21 0x00007f253165a908 in PostgresMain (argc=828737552, argv=<value optimized
out>, username=0x7f2532b90820 "monitoring") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:3650
#22 0x00007f2531619ad3 in BackendRun () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3606
#23 BackendStartup () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3291
#24 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1455
#25 0x00007f253161c57c in PostmasterMain (argc=851231024, argv=0x7f2532b8c8f0)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#26 0x00007f25315baec3 in main (argc=3, argv=0x7f2532b8c8e0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 20:36:23
Message-ID: 4F9B0347.8020302@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

> I think you can probably skip all that are blocked in LWLockAcquire
> called from bufmgr.c:531, at least for a first pass. Calls from
> elsewhere in bufmgr.c might be more interesting, and anything that's not
> blocked at an LWLockAcquire at all might be even more interesting.

A few more that include bufmgr.c:

#0 0x00007f62b6150583 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f62b82948ca in pg_usleep (microsec=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/port/pgsleep.c:43
#2 0x00007f62b8159678 in CheckpointWriteDelay (flags=<value optimized out>,
progress=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/bgwriter.c:722
#3 0x00007f62b81806ab in BufferSync (flags=64) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:1283
#4 CheckPointBuffers (flags=64) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:1764
#5 0x00007f62b7fea0e3 in CheckPointGuts (checkPointRedo=..., flags=64) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/transam/xlog.c:7989
#6 0x00007f62b7ff5a3c in CreateCheckPoint (flags=64) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/transam/xlog.c:7797
#7 0x00007f62b8159a46 in BackgroundWriterMain () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/bgwriter.c:502
#8 0x00007f62b80115c2 in AuxiliaryProcessMain (argc=<value optimized out>,
argv=0x7fff9045db00) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/bootstrap/bootstrap.c:422
#9 0x00007f62b815fe8b in StartChildProcess (type=BgWriterProcess) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:4504
#10 0x00007f62b8162ec2 in reaper (postgres_signal_arg=<value optimized out>)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:2424
#11 <signal handler called>
#12 0x00007f62b6150583 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#13 0x00007f62b81614d3 in ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1415
#14 0x00007f62b816457c in PostmasterMain (argc=-1186270896,
argv=0x7f62b9471190) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#15 0x00007f62b8102ec3 in main (argc=5, argv=0x7f62b9471170) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199

#0 0x00007f62b6149a50 in __read_nocancel () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f62b8183ced in read (file=<value optimized out>,
buffer=0x7f613ff51fa0 ":\n", amount=<value optimized out>) at
/usr/include/bits/unistd.h:45
#2 FileRead (file=<value optimized out>, buffer=0x7f613ff51fa0 ":\n",
amount=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/file/fd.c:1191
#3 0x00007f62b819c7a6 in mdread (reln=<value optimized out>, forknum=<value
optimized out>, blocknum=39946, buffer=0x7f613ff51fa0 ":\n") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/smgr/md.c:640
#4 0x00007f62b81807f1 in ReadBuffer_common (smgr=0x7f62b955d090,
relpersistence=<value optimized out>, forkNum=<value optimized out>,
blockNum=<value optimized out>, mode=<value optimized out>, strategy=<value
optimized out>, hit=0x7fff9045d5cf "") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:438
#5 0x00007f62b81811e4 in ReadBufferExtended (reln=0x7f62b7eef540,
forkNum=MAIN_FORKNUM, blockNum=39946, mode=RBM_NORMAL, strategy=0x4ff74188) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:246
#6 0x00007f62b7fc97bf in heapgetpage (scan=0x7f62b9558350, page=39946) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:223
#7 0x00007f62b7fca1ba in heapgettup (scan=0x7f62b9558350, dir=<value
optimized out>, nkeys=<value optimized out>, key=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:556
#8 0x00007f62b7fca7bc in heap_getnext (scan=0x7f62b9558350,
direction=1073029024) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:1345
#9 0x00007f62b8259570 in RelationBuildTupleDesc (targetRelId=<value optimized
out>, insertIt=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:468
#10 RelationBuildDesc (targetRelId=<value optimized out>, insertIt=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:882
#11 0x00007f62b825af26 in RelationIdGetRelation (relationId=2601) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1568
#12 0x00007f62b7fcdaab in relation_open (relationId=2601, lockmode=1073029024)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:907
#13 0x00007f62b7fcdb33 in heap_open (relationId=8, lockmode=1073029024) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:1077
#14 0x00007f62b8252d82 in SearchCatCache (cache=0x7f62b94c9e70, v1=403,
v2=<value optimized out>, v3=0, v4=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/catcache.c:1184
#15 0x00007f62b8258a30 in RelationInitIndexAccessInfo
(relation=0x7f62b7efa828) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:1010
#16 0x00007f62b8259d91 in RelationBuildDesc (targetRelId=<value optimized
out>, insertIt=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:904
#17 0x00007f62b825a01d in load_critical_index (indexoid=2655, heapoid=2603) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:3080
#18 0x00007f62b825a4c3 in RelationCacheInitializePhase3 () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/cache/relcache.c:2890
#19 0x00007f62b82713e9 in InitPostgres (in_dbname=<value optimized out>,
dboid=<value optimized out>, username=<value optimized out>, out_dbname=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/utils/init/postinit.c:816
#20 0x00007f62b81a2908 in PostgresMain (argc=-1206253552, argv=<value
optimized out>, username=0x7f62b9471c90 "consprod") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:3650
#21 0x00007f62b8161ad3 in BackendRun () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3606
#22 BackendStartup () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3291
#23 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1455
#24 0x00007f62b816457c in PostmasterMain (argc=-1186270896,
argv=0x7f62b9471190) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#25 0x00007f62b8102ec3 in main (argc=5, argv=0x7f62b9471170) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199

and I've got 81 more that do not contain bufmgr.c and are also not block on
LWLockAcquire.


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-27 21:57:36
Message-ID: 4F9B1650.9080906@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

It's tough to catch the thing in the act on the new server as they always
eventually clear..though they still sit in startup for up to 30 seconds on
occasion.

A few here that look interesting:

#0 0x00007f62b6150583 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#1 0x00007f62b82948ca in pg_usleep (microsec=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/port/pgsleep.c:43
#2 0x00007f62b8195119 in s_lock (lock=0x7f60a5ff3540 "", file=0x7f62b83a7408
"/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/lmgr/lwlock.c",
line=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/lmgr/s_lock.c:113
#3 0x00007f62b8194cb7 in LWLockRelease (lockid=38) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/lmgr/lwlock.c:593
#4 0x00007f62b81808dd in BufferAlloc (smgr=0x7f62b95a2910,
relpersistence=<value optimized out>, forkNum=<value optimized out>,
blockNum=<value optimized out>, mode=<value optimized out>, strategy=<value
optimized out>, hit=0x7fff9045d92f "") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:745
#5 ReadBuffer_common (smgr=0x7f62b95a2910, relpersistence=<value optimized
out>, forkNum=<value optimized out>, blockNum=<value optimized out>,
mode=<value optimized out>, strategy=<value optimized out>, hit=0x7fff9045d92f
"") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:325
#6 0x00007f62b81811e4 in ReadBufferExtended (reln=0x7f60a58bea30,
forkNum=MAIN_FORKNUM, blockNum=14895, mode=RBM_NORMAL, strategy=0x18a994d6) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:246
#7 0x00007f62b7fd5110 in index_getnext (scan=0x7f62b978dc90, direction=<value
optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/index/indexam.c:514
#8 0x00007f62b80e2946 in IndexNext (node=0x7f62b978cb08) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeIndexscan.c:78
#9 0x00007f62b80d77b7 in ExecScanFetch (node=0x7f62b978cb08, accessMtd=<value
optimized out>, recheckMtd=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execScan.c:82
#10 ExecScan (node=0x7f62b978cb08, accessMtd=<value optimized out>,
recheckMtd=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execScan.c:132
#11 0x00007f62b80d0068 in ExecProcNode (node=0x7f62b978cb08) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:398
#12 0x00007f62b80e51c7 in ExecMergeJoin (node=0x7f62b978c6b8) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeMergejoin.c:1285
#13 0x00007f62b80cffc8 in ExecProcNode (node=0x7f62b978c6b8) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:443
#14 0x00007f62b80dba99 in ExecAppend (node=0x7f62b9783ee0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeAppend.c:209
#15 0x00007f62b80d00a8 in ExecProcNode (node=0x7f62b9783ee0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:375
#16 0x00007f62b80dda58 in agg_fill_hash_table (node=0x7f62b97837d0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeAgg.c:1237
#17 ExecAgg (node=0x7f62b97837d0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeAgg.c:1007
#18 0x00007f62b80cff78 in ExecProcNode (node=0x7f62b97837d0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:466
#19 0x00007f62b80ceed2 in ExecutePlan (queryDesc=0x7f62b9783130,
direction=NoMovementScanDirection, count=0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execMain.c:1439
#20 standard_ExecutorRun (queryDesc=0x7f62b9783130,
direction=NoMovementScanDirection, count=0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execMain.c:313
#21 0x00007f62b81a45c7 in PortalRunSelect (portal=0x7f62b94afdf0,
forward=<value optimized out>, count=0, dest=0x7f62b9543080) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/pquery.c:943
#22 0x00007f62b81a5a30 in PortalRun (portal=<value optimized out>,
count=<value optimized out>, isTopLevel=<value optimized out>, dest=<value
optimized out>, altdest=<value optimized out>, completionTag=<value optimized
out>) at /build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/pquery.c:787
#23 0x00007f62b81a310c in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, username=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:1963
#24 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
username=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:3983
#25 0x00007f62b8161ad3 in BackendRun () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3606
#26 BackendStartup () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3291
#27 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1455
#28 0x00007f62b816457c in PostmasterMain (argc=-1186270896,
argv=0x7f62b9471190) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#29 0x00007f62b8102ec3 in main (argc=5, argv=0x7f62b9471170) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199
=================
[Thread debugging using libthread_db enabled]
PinBuffer (buf=0x7f60ac8748f0, strategy=0x0)
at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:1060
1060
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:
No such file or directory.
in
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c
#0 PinBuffer (buf=0x7f60ac8748f0, strategy=0x0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:1060
#1 0x00007f62b81808d2 in BufferAlloc (smgr=0x7f62b961be00,
relpersistence=<value optimized out>, forkNum=<value optimized out>,
blockNum=<value optimized out>, mode=<value optimized out>, strategy=<value
optimized out>, hit=0x7fff9045d98f "") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:742
#2 ReadBuffer_common (smgr=0x7f62b961be00, relpersistence=<value optimized
out>, forkNum=<value optimized out>, blockNum=<value optimized out>,
mode=<value optimized out>, strategy=<value optimized out>, hit=0x7fff9045d98f
"") at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:325
#3 0x00007f62b81811e4 in ReadBufferExtended (reln=0x7f62b9787d78,
forkNum=MAIN_FORKNUM, blockNum=865, mode=RBM_NORMAL, strategy=0xa01962db) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/storage/buffer/bufmgr.c:246
#4 0x00007f62b7fc97bf in heapgetpage (scan=0x7f62b9874730, page=865) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:223
#5 0x00007f62b7fc9cea in heapgettup_pagemode (scan=0x7f62b9874730, dir=<value
optimized out>, nkeys=<value optimized out>, key=0x0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:817
#6 0x00007f62b7fca786 in heap_getnext (scan=0x7f62b9874730,
direction=NoMovementScanDirection) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/access/heap/heapam.c:1342
#7 0x00007f62b80e7f8b in SeqNext (node=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeSeqscan.c:66
#8 0x00007f62b80d769a in ExecScanFetch (node=0x7f62b9873800, accessMtd=<value
optimized out>, recheckMtd=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execScan.c:82
#9 ExecScan (node=0x7f62b9873800, accessMtd=<value optimized out>,
recheckMtd=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execScan.c:167
#10 0x00007f62b80d0078 in ExecProcNode (node=0x7f62b9873800) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:394
#11 0x00007f62b80e8a69 in ExecSort (node=0x7f62b9873560) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/nodeSort.c:103
#12 0x00007f62b80cff98 in ExecProcNode (node=0x7f62b9873560) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execProcnode.c:458
#13 0x00007f62b80ceed2 in ExecutePlan (queryDesc=0x7f62b9871970,
direction=NoMovementScanDirection, count=0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execMain.c:1439
#14 standard_ExecutorRun (queryDesc=0x7f62b9871970,
direction=NoMovementScanDirection, count=0) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/executor/execMain.c:313
#15 0x00007f62b81a45c7 in PortalRunSelect (portal=0x7f62b94afdf0,
forward=<value optimized out>, count=0, dest=0x7f62b9543080) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/pquery.c:943
#16 0x00007f62b81a5a30 in PortalRun (portal=<value optimized out>,
count=<value optimized out>, isTopLevel=<value optimized out>, dest=<value
optimized out>, altdest=<value optimized out>, completionTag=<value optimized
out>) at /build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/pquery.c:787
#17 0x00007f62b81a310c in exec_execute_message (argc=<value optimized out>,
argv=<value optimized out>, username=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:1963
#18 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
username=<value optimized out>) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/tcop/postgres.c:3983
#19 0x00007f62b8161ad3 in BackendRun () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3606
#20 BackendStartup () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:3291
#21 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1455
#22 0x00007f62b816457c in PostmasterMain (argc=-1186270896,
argv=0x7f62b9471190) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/postmaster/postmaster.c:1116
#23 0x00007f62b8102ec3 in main (argc=5, argv=0x7f62b9471170) at
/build/buildd/postgresql-9.1-9.1.3/build/../src/backend/main/main.c:199

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 00:30:58
Message-ID: 9495.1335573058@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> and I've got 81 more that do not contain bufmgr.c and are also not block on
> LWLockAcquire.

Hm ... no smoking gun in what you showed so far. I also took another
look through 9.1 bufmgr.c, and I'm darned if I can see any code path
there that holds a buffer mapping lock for any long interval.

One possible theory is that you're using pg_buffercache_pages(), which
does take all those locks. It tries its best to not hold them long,
but with a sufficiently large buffer cache there could be an issue.
(What is the shared_buffers setting on this installation, anyway?)

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 00:45:43
Message-ID: 4F9B3DB7.5020308@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 17:30, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> and I've got 81 more that do not contain bufmgr.c and are also not block on
>> LWLockAcquire.
> Hm ... no smoking gun in what you showed so far. I also took another
> look through 9.1 bufmgr.c, and I'm darned if I can see any code path
> there that holds a buffer mapping lock for any long interval.
>
> One possible theory is that you're using pg_buffercache_pages(), which
> does take all those locks. It tries its best to not hold them long,
> but with a sufficiently large buffer cache there could be an issue.
> (What is the shared_buffers setting on this installation, anyway?)
>

Tom, what does the startup code path read? The list of relations or
something? Just wondering what the contention would be.

Oh, good idea! Looks like pg_buffercache is installed in this DB. Customer
reports that it has been installed since the server has existed (and on the
previous server) but is not currently being used, though the issue with the
hanging startups did not start until this morning. Could it still cause
contention even if it's not being executed?

shared_buffers is 8GB on this machine which has 128GB of RAM.

It seems the pooling software that's being used has a max connection lifetime
of 1hr which is what makes this so painful.

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 01:05:28
Message-ID: 4F9B4258.6050602@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 17:45, Jeff Frost wrote:
> Oh, good idea! Looks like pg_buffercache is installed in this DB. Customer
> reports that it has been installed since the server has existed (and on the
> previous server) but is not currently being used, though the issue with the
> hanging startups did not start until this morning. Could it still cause
> contention even if it's not being executed?

BTW, just to be safe, it is now removed.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 01:15:23
Message-ID: 10406.1335575723@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> Oh, good idea! Looks like pg_buffercache is installed in this DB. Customer
> reports that it has been installed since the server has existed (and on the
> previous server) but is not currently being used, though the issue with the
> hanging startups did not start until this morning. Could it still cause
> contention even if it's not being executed?

No, it shouldn't. Perhaps we should start asking questions around the
idea of "so what changed this morning?". There has to have been some
triggering cause, a configuration change or application change or
something.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 01:27:34
Message-ID: 4F9B4786.4000104@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 18:15, Tom Lane wrote:
> No, it shouldn't. Perhaps we should start asking questions around the
> idea of "so what changed this morning?". There has to have been some
> triggering cause, a configuration change or application change or
> something.
Oh, we've been asking those questions all day long!

No packages were installed, postgresql.conf last edited on 3/29/2012. Last
application update 4/24/2012. There were some DDL changes, but those were
reverted and the problem persists. To make it more interesting, today is a
slow day.

And we just saw the issue recur with pg_buffercache uninstalled. :-/

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 01:34:43
Message-ID: 4F9B4933.5080604@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/27/12 18:27, Jeff Frost wrote:
> To make it more interesting, today is a
> slow day.

And since it's a slow day..one more question..any further logging we could do
to help find the culprit?


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-28 03:02:30
Message-ID: 3D4E57DB-F408-4FAD-B563-97D57CED9820@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


On Apr 27, 2012, at 6:34 PM, Jeff Frost wrote:

> On 04/27/12 18:27, Jeff Frost wrote:
>> To make it more interesting, today is a
>> slow day.
>
> And since it's a slow day..one more question..any further logging we could do
> to help find the culprit?

FYI, reindexed the system tables and reduced shared_buffers to 2GB. Still seeing the issue with regularity.


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-29 00:17:25
Message-ID: 343438C7-5C1A-47B9-BBE8-AE0E6DC57804@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


On Apr 27, 2012, at 6:15 PM, Tom Lane wrote:

> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> Oh, good idea! Looks like pg_buffercache is installed in this DB. Customer
>> reports that it has been installed since the server has existed (and on the
>> previous server) but is not currently being used, though the issue with the
>> hanging startups did not start until this morning. Could it still cause
>> contention even if it's not being executed?
>
> No, it shouldn't. Perhaps we should start asking questions around the
> idea of "so what changed this morning?". There has to have been some
> triggering cause, a configuration change or application change or
> something.

Since I had a theory that it's probably stalling on pg_catalog access, one of the guys wrote a test harness that makes several connections and creates and drops lots of temp tables. That does seem to allow us to reproduce the issue in the production DB, but not in a newly created DB. So, we're about to initdb a new data directory and slony replicate the data over to it on the original server.


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-04-30 16:41:09
Message-ID: 4F9EC0A5.2060009@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 04/28/12 17:17, Jeff Frost wrote:
> Since I had a theory that it's probably stalling on pg_catalog access, one of the guys wrote a test harness that makes several connections and creates and drops lots of temp tables. That does seem to allow us to reproduce the issue in the production DB, but not in a newly created DB. So, we're about to initdb a new data directory and slony replicate the data over to it on the original server.

After slony replicating over to the freshly initdb'd original server, the
problem has not recurred. Maybe the problem was catalog corruption?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 19:21:37
Message-ID: 28484.1337887297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
> get stuck in 'startup' mode. By that I mean the set_ps_output mode. Postgres
> is installed via Martin Pitt's packages.

I took another look at your original report here, and noticed something
that had escaped me the first time: the backtrace shows that what that
backend is doing is loading critical relcache entries the hard way, that
is, via seqscans on the system catalogs. This happens when the relcache
cache file (pg_internal.init) is missing or stale. A backend that has
to go through that will write a new pg_internal.init file when it's
done; but if many sessions are connecting at the same time, they'll each
have to do it until someone succeeds in creating a new cache file.

Another piece of evidence is here:

> #5 0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
> forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
> out>) at bufmgr.c:246

It's possible that gdb is lying to us, but it sure looks like this read
request is for block 201362 of pg_attribute (I know it's pg_attribute
from the location of the heap_getnext call). That's 1.5Gb into the
relation, and who knows if it's anywhere near the end yet.

So in other words, we've got a boatload of backends all seqscanning a
pretty darn large relation, and doing so multiple times. That's
probably enough to explain why it took a long time to finish
initializing; the locking business is likely a red herring.

How big is pg_attribute normally in this database? I'm suspicious that
what triggered this is some major bloating of pg_attribute (and maybe
some of the other catalogs too). Once you had that, relcache cache file
invalidations are a fairly routine occurrence --- I think an autovacuum
on one of the system catalogs would trigger that, for example.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 20:40:59
Message-ID: 4FBE9CDB.3030009@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 05/24/12 12:21, Tom Lane wrote:
> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
>> get stuck in 'startup' mode. By that I mean the set_ps_output mode. Postgres
>> is installed via Martin Pitt's packages.
> I took another look at your original report here, and noticed something
> that had escaped me the first time: the backtrace shows that what that
> backend is doing is loading critical relcache entries the hard way, that
> is, via seqscans on the system catalogs. This happens when the relcache
> cache file (pg_internal.init) is missing or stale. A backend that has
> to go through that will write a new pg_internal.init file when it's
> done; but if many sessions are connecting at the same time, they'll each
> have to do it until someone succeeds in creating a new cache file.
>
> Another piece of evidence is here:
>
>> #5 0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
>> forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
>> out>) at bufmgr.c:246
> It's possible that gdb is lying to us, but it sure looks like this read
> request is for block 201362 of pg_attribute (I know it's pg_attribute
> from the location of the heap_getnext call). That's 1.5Gb into the
> relation, and who knows if it's anywhere near the end yet.
>
> So in other words, we've got a boatload of backends all seqscanning a
> pretty darn large relation, and doing so multiple times. That's
> probably enough to explain why it took a long time to finish
> initializing; the locking business is likely a red herring.
>
> How big is pg_attribute normally in this database? I'm suspicious that
> what triggered this is some major bloating of pg_attribute (and maybe
> some of the other catalogs too). Once you had that, relcache cache file
> invalidations are a fairly routine occurrence --- I think an autovacuum
> on one of the system catalogs would trigger that, for example.
>
> regards, tom lane

Interesting! Fortunately, that data directory is still lying around even
though we initdb'd to get a working system.

So, the current working system's pg_attribute looks like:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1671 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
2439 MB

and on the old broken system:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1858 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
pg_size_pretty
----------------
1907 MB
(1 row)

So, interestingly, they're both quite large, but the old broken system is
quite a bit larger. Any other data points be helpful?

--
Jeff Frost <jeff(at)pgexperts(dot)com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 22:13:06
Message-ID: 2196.1337897586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> On 05/24/12 12:21, Tom Lane wrote:
>> How big is pg_attribute normally in this database? I'm suspicious that
>> what triggered this is some major bloating of pg_attribute (and maybe
>> some of the other catalogs too).

> So, the current working system's pg_attribute looks like:
> select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
> pg_size_pretty
> ----------------
> 1671 MB
> (1 row)

> and on the old broken system:
> select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
> pg_size_pretty
> ----------------
> 1858 MB
> (1 row)

Huh. A bit bigger, but not by that much. It doesn't seem like this
would be enough to make seqscan performance fall off a cliff, as it
apparently did. Unless maybe the slightly-bloated catalogs were just a
bit too large to fit in RAM, and so the repeated seqscans went from
finding all their data in kernel disk buffers to finding none of it?

> So, interestingly, they're both quite large, but the old broken system is
> quite a bit larger. Any other data points be helpful?

I think it would be interesting to get the pg_relation_size for pg_class
plus pg_attribute plus pg_index (which I think is everything that gets
seqscannedd in this way) on both systems, and see how those numbers
match up to total RAM on the box.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 22:25:37
Message-ID: 7A939C5B-ABA4-4CF4-9473-F48989399813@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


On May 24, 2012, at 3:13 PM, Tom Lane wrote:

> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> On 05/24/12 12:21, Tom Lane wrote:
>
> Huh. A bit bigger, but not by that much. It doesn't seem like this
> would be enough to make seqscan performance fall off a cliff, as it
> apparently did. Unless maybe the slightly-bloated catalogs were just a
> bit too large to fit in RAM, and so the repeated seqscans went from
> finding all their data in kernel disk buffers to finding none of it?

Seems unlikely.

>
>> So, interestingly, they're both quite large, but the old broken system is
>> quite a bit larger. Any other data points be helpful?
>
> I think it would be interesting to get the pg_relation_size for pg_class
> plus pg_attribute plus pg_index (which I think is everything that gets
> seqscannedd in this way) on both systems, and see how those numbers
> match up to total RAM on the box.

Server has 128GB of RAM.

Currently running system:

select pg_size_pretty(pg_relation_size('pg_class'));
pg_size_pretty
----------------
181 MB
(1 row)

select pg_size_pretty(pg_relation_size('pg_index'));
pg_size_pretty
----------------
23 MB
(1 row)

Old broken system:

select pg_size_pretty(pg_relation_size('pg_class'));
pg_size_pretty
----------------
221 MB
(1 row)

select pg_size_pretty(pg_relation_size('pg_index'));
pg_size_pretty
----------------
44 MB
(1 row)

BTW, when I connected to it this time, I had a really long time before my psql was able to send a query, so it seems to be still broken at least.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 22:35:46
Message-ID: 2784.1337898946@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> On May 24, 2012, at 3:13 PM, Tom Lane wrote:
>> Huh. A bit bigger, but not by that much. It doesn't seem like this
>> would be enough to make seqscan performance fall off a cliff, as it
>> apparently did. Unless maybe the slightly-bloated catalogs were just a
>> bit too large to fit in RAM, and so the repeated seqscans went from
>> finding all their data in kernel disk buffers to finding none of it?

> Seems unlikely.
> Server has 128GB of RAM.

Hm ... sure seems like that ought to be enough ;-), even with a near-2GB
pg_attribute table. What's your shared_buffers setting?

> BTW, when I connected to it this time, I had a really long time before my psql was able to send a query, so it seems to be still broken at least.

Yeah, I was afraid that re-initdb would be at best a temporary solution.

It would probably be useful to confirm the theory that relcache rebuild
is what makes it stall. You should be able to manually remove the
pg_internal.init file in the database's directory; then connect with
psql, and time how long it takes before the pg_internal.init file
reappears.

regards, tom lane


From: Jeff Frost <jeff(at)pgexperts(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-24 22:39:34
Message-ID: 3B46C63B-A770-45C5-8889-2796C5E3AB84@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


On May 24, 2012, at 3:35 PM, Tom Lane wrote:

> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>> On May 24, 2012, at 3:13 PM, Tom Lane wrote:
>>> Huh. A bit bigger, but not by that much. It doesn't seem like this
>>> would be enough to make seqscan performance fall off a cliff, as it
>>> apparently did. Unless maybe the slightly-bloated catalogs were just a
>>> bit too large to fit in RAM, and so the repeated seqscans went from
>>> finding all their data in kernel disk buffers to finding none of it?
>
>> Seems unlikely.
>> Server has 128GB of RAM.
>
> Hm ... sure seems like that ought to be enough ;-), even with a near-2GB
> pg_attribute table. What's your shared_buffers setting?

It's 8GB.

>
>> BTW, when I connected to it this time, I had a really long time before my psql was able to send a query, so it seems to be still broken at least.
>
> Yeah, I was afraid that re-initdb would be at best a temporary solution.

Oh, sorry, I wasn't clear on that. The currently running system is still happy, but the old data directory got stuck in 'startup' for a while when I connected via psql.

>
> It would probably be useful to confirm the theory that relcache rebuild
> is what makes it stall. You should be able to manually remove the
> pg_internal.init file in the database's directory; then connect with
> psql, and time how long it takes before the pg_internal.init file
> reappears.

So, you're thinking autovac invalidates the cache and causes it to be rebuilt, then a bunch of new connections get stalled as they all wait for the rebuild?

I'll see if I can get the customer to move the data directory to a test system so I can futz with it on a non production system.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Frost <jeff(at)pgexperts(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: 9.1.3 backends getting stuck in 'startup'
Date: 2012-05-25 14:51:20
Message-ID: 21111.1337957480@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Jeff Frost <jeff(at)pgexperts(dot)com> writes:
> On May 24, 2012, at 3:35 PM, Tom Lane wrote:
>> Jeff Frost <jeff(at)pgexperts(dot)com> writes:
>>> BTW, when I connected to it this time, I had a really long time before my psql was able to send a query, so it seems to be still broken at least.

>> Yeah, I was afraid that re-initdb would be at best a temporary solution.

> Oh, sorry, I wasn't clear on that. The currently running system is still happy, but the old data directory got stuck in 'startup' for a while when I connected via psql.

Hm. That means there is some contributing factor we haven't identified,
but what?

>> It would probably be useful to confirm the theory that relcache rebuild
>> is what makes it stall. You should be able to manually remove the
>> pg_internal.init file in the database's directory; then connect with
>> psql, and time how long it takes before the pg_internal.init file
>> reappears.

> So, you're thinking autovac invalidates the cache and causes it to be rebuilt, then a bunch of new connections get stalled as they all wait for the rebuild?

Well, it's worse than that: all the incoming sessions each try to
rebuild the cache file themselves. So I suspect there's some
"thundering herd" effect on your production system, in addition to
the fact that it's just plain not cheap. However, if you're seeing
significant stall with just a single incoming connection (the old
system isn't getting actively used, right?) then there is possibly
yet another effect that we need to identify.

As the next step, I'd suggest verifying that the stall is reproducible
if you remove pg_internal.init (and that it's not there otherwise), and
then strace'ing the single incoming connection to see what it's doing.

regards, tom lane