Re: Autovacuum daemon terminated by signal 11

Lists: pgsql-generalpgsql-hackers
From: "Justin Pasher" <justinp(at)newmediagateway(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 03:36:41
Message-ID: 1A40D51C8E364DECBF90C045C043FA82@taz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello,

I have a server running PostgreSQL 8.1.15-0etch1 (Debian etch) that was
recently put into production. Last week a developer started having a problem
with his psql connection being terminated every couple of minutes when he
was running a query. When I look through the logs, I noticed this message.

2009-01-09 08:09:46 CST LOG: autovacuum process (PID 15012) was terminated
by signal 11

I looked through the logs some more and I noticed that this was occurring
every minute or so. The database is a pretty heavily utilized system
(judging by the age(datfrozenxid) from pg_database, the system had run
approximately 500 million queries in less than a week). I noticed that right
before every autovacuum termination, it tried to autovacuum a database.

2009-01-09 08:09:46 CST LOG: transaction ID wrap limit is 4563352, limited
by database "database_name"

It was always showing the same database, so I decided to manually vacuum the
database. Once that was done (it was successful the first time without
errors), the problem seemed to go away. I went ahead and manually vacuumed
the remaining databases just to take care of the potential xid wraparound
issue.

I figured it was just an isolated incident, until it started happening again
this week. Same scenario as before: over 500 million queries since the
beginning of this week and autovacuum dying on the same database every time.
However, the problematic database was different than last time, so it
doesn't seem to be specific to one particular database.

Looking through the archives I've seen this (exact?) same problem crop up
before, but it was addressed in Postgres 8.1.1

http://archives.postgresql.org/pgsql-bugs/2006-01/msg00014.php

This article also mentioned the previous bug was related to triggers on the
table, but the second time this happened to me, the database in question
only has two simple tables (no triggers, one foreign key linking them, a few
btree indices).

What else can I do to determine the cause of this? For the time being, I
should be able to setup a cron job to run a manual vacuum every other day to
ensure that age(datfrozenxid) stays low, but I'd like to understand what
would be causing this.

Justin Pasher


From: Yi Zhao <yi(dot)zhao(at)alibaba-inc(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: how can I returns a set of integer in a plpgsql function?
Date: 2009-01-15 07:37:57
Message-ID: 1232005077.3109.9.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

hi, all:
there is a function in postgresql contrib "int_arrgreagte":

CREATE OR REPLACE FUNCTION int_array_enum(int4[])
RETURNS setof integer
AS '$libdir/int_aggregate','int_enum'
LANGUAGE C IMMUTABLE STRICT;

I can use this function like this:
chry=# SELECT int_array_enum('{1,2}');
or
chry=# SELECT * from int_array_enum('{1,2}');
result:
int_array_enum
----------------
1
2

also, I can use it like this:
SELECT int_array_enum('{1,2}'), int_array_enum('{1,3}');
result:
int_array_enum | int_array_enum
----------------+----------------
1 | 1
2 | 3

I try to write my own function with the same return type in plpgsql:
create or replace function my_int_array_enum(state integer[]) returns
setof integer as $$
declare
i integer;
begin
for i in array_lower(state,1)..array_upper(state,1) loop
return next state[i];
end loop;
return;
end;
$$ language 'plpgsql' immutable;

but, when I use my function like this: test=# SELECT
my_int_array_enum('{1,2}');
I got the error said as below:
ERROR: set-valued function called in context that cannot accept a set
(I know, SELECT * from my_int_array_enum('{1,2}') is ok)

can anybody tell me the reason, thanks for any help:D

regards,


From: Richard Huxton <dev(at)archonet(dot)com>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 09:47:41
Message-ID: 496F063D.1020308@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher wrote:
> Hello,
>
> I have a server running PostgreSQL 8.1.15-0etch1 (Debian etch) that was
> recently put into production. Last week a developer started having a problem
> with his psql connection being terminated every couple of minutes when he
> was running a query. When I look through the logs, I noticed this message.
>
> 2009-01-09 08:09:46 CST LOG: autovacuum process (PID 15012) was terminated
> by signal 11

Segmentation fault - probably a bug or bad RAM.

> I looked through the logs some more and I noticed that this was occurring
> every minute or so. The database is a pretty heavily utilized system
> (judging by the age(datfrozenxid) from pg_database, the system had run
> approximately 500 million queries in less than a week). I noticed that right
> before every autovacuum termination, it tried to autovacuum a database.
>
> 2009-01-09 08:09:46 CST LOG: transaction ID wrap limit is 4563352, limited
> by database "database_name"
>
> It was always showing the same database, so I decided to manually vacuum the
> database. Once that was done (it was successful the first time without
> errors), the problem seemed to go away. I went ahead and manually vacuumed
> the remaining databases just to take care of the potential xid wraparound
> issue.

I'd be suspicious of possible corruption in autovacuum's internal data.
Can you trace these problems back to a power-outage or system crash? It
doesn't look like "database_name" itself since you vacuumed that
successfully. If autovacuum is running normally now, that might indicate
it was something in the way autovacuum was keeping track of "database_name".

It's also probably worth running some memory tests on the server -
(memtest86 or similar) to see if that shows anything. Was it *always*
the autovacuum process getting sig11? If not then it might just be a
pattern of usage that makes it more likely to use some bad RAM.

--
Richard Huxton
Archonet Ltd


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 18:35:24
Message-ID: 496F81EC.1050408@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Richard Huxton wrote:
> Justin Pasher wrote:
>
>> Hello,
>>
>> I have a server running PostgreSQL 8.1.15-0etch1 (Debian etch) that was
>> recently put into production. Last week a developer started having a problem
>> with his psql connection being terminated every couple of minutes when he
>> was running a query. When I look through the logs, I noticed this message.
>>
>> 2009-01-09 08:09:46 CST LOG: autovacuum process (PID 15012) was terminated
>> by signal 11
>>
>
> Segmentation fault - probably a bug or bad RAM.
>

It's a relatively new machine, but that's obviously a possibility with
any hardware. I haven't seen any other programs experiencing problems on
the box, but the Postgres daemon is the one that is primarily utilized,
so it's a little biased toward that.

>> I looked through the logs some more and I noticed that this was occurring
>> every minute or so. The database is a pretty heavily utilized system
>> (judging by the age(datfrozenxid) from pg_database, the system had run
>> approximately 500 million queries in less than a week). I noticed that right
>> before every autovacuum termination, it tried to autovacuum a database.
>>
>> 2009-01-09 08:09:46 CST LOG: transaction ID wrap limit is 4563352, limited
>> by database "database_name"
>>
>> It was always showing the same database, so I decided to manually vacuum the
>> database. Once that was done (it was successful the first time without
>> errors), the problem seemed to go away. I went ahead and manually vacuumed
>> the remaining databases just to take care of the potential xid wraparound
>> issue.
>>
>
> I'd be suspicious of possible corruption in autovacuum's internal data.
> Can you trace these problems back to a power-outage or system crash? It
> doesn't look like "database_name" itself since you vacuumed that
> successfully. If autovacuum is running normally now, that might indicate
> it was something in the way autovacuum was keeping track of "database_name".
>

The server hasn't been rebooted since it was installed (about 9 months
ago, but only being utilized within the past month), so there haven't
been any crashes or power outages. The only abnormal things I can find
in the Postgres logs are the autovacuum segfaults. Looking in the logs
today, it looks like it's still happening (once again on a different
database). I manually vacuumed that one database and the problem went
away (for now).

Are there any internal Postgres tables I can look at that may shed some
light on this? Any particular maintenance commands that could be run for
repair?

> It's also probably worth running some memory tests on the server -
> (memtest86 or similar) to see if that shows anything. Was it *always*
> the autovacuum process getting sig11? If not then it might just be a
> pattern of usage that makes it more likely to use some bad RAM

I might try the memtest if we can actually get the databases off of the
server to allow some downtime. None of the logs indicate anything else
acting abnormally or being terminated abnormally, just the autovacuum
daemon. From what I can tell, the segfaults only when the databases pass
the half way point (when age(datfrozenxid) exceeds around 1500000000).
When this is not the case, the segfaults do not occur according to the logs.

Justin Pasher


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 18:42:01
Message-ID: 20090115184201.GH6440@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

>> Justin Pasher wrote:

> Are there any internal Postgres tables I can look at that may shed some
> light on this? Any particular maintenance commands that could be run for
> repair?

Please obtain a backtrace from the core file. If there's no core file,
please set "ulimit -c unlimited" in the Postgres start script and
restart it so that it is allowed to produce one.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 18:43:29
Message-ID: 9507.1232045009@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
> Richard Huxton wrote:
>> Segmentation fault - probably a bug or bad RAM.

> It's a relatively new machine, but that's obviously a possibility with
> any hardware. I haven't seen any other programs experiencing problems on
> the box, but the Postgres daemon is the one that is primarily utilized,
> so it's a little biased toward that.

I agree that the behavior seems a bit too specific to be a hardware
issue.

Can you get a stack trace from the crash? You might need to restart the
postmaster under "ulimit -c unlimited" to get a core dump from the
crashing autovacuum process.

regards, tom lane


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 22:02:09
Message-ID: 496FB261.8040303@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
>
>> Richard Huxton wrote:
>>
>>> Segmentation fault - probably a bug or bad RAM.
>>>
>> It's a relatively new machine, but that's obviously a possibility with
>> any hardware. I haven't seen any other programs experiencing problems on
>> the box, but the Postgres daemon is the one that is primarily utilized,
>> so it's a little biased toward that.
>>
>
> I agree that the behavior seems a bit too specific to be a hardware
> issue.
>
> Can you get a stack trace from the crash? You might need to restart the
> postmaster under "ulimit -c unlimited" to get a core dump from the
> crashing autovacuum process.
>
> regards, tom lane
>

I'm working on getting the database running on another server so I can
perform more tests. So far I was able to get a copy of the cluster up
and running. Once the autovacuum process kicked in, it started
experiencing the same segfault on the new box. At this point, the
hardware on the original box no longer seems to be a culprit (assuming
the data files themselves aren't corrupted and I didn't just bring the
corruption along with the cluster).

I'll let you know when I get a chance to get a core dump from the
process. I assume I will need a version of Postgres built with debug
symbols for it to be useful? I'm not seeing one in the standard Debian
repositories, so I might have to compile from source.

Justin Pasher


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 22:16:33
Message-ID: 13201.1232057793@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
> I'll let you know when I get a chance to get a core dump from the
> process. I assume I will need a version of Postgres built with debug
> symbols for it to be useful? I'm not seeing one in the standard Debian
> repositories, so I might have to compile from source.

Having debug symbols would be more useful, but unless the binary is
totally stripped, a backtrace might provide enough info without that.
Try it and see if you get any function names in the trace, or only
numbers.

(BTW, does Debian have anything comparable to Red Hat's debuginfo
packages? You might be able to get the debug symbols without having
to recompile for yourself. Recompile is a bit of a pain since you have
to take care to match the original compilation options exactly.)

regards, tom lane


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 22:30:01
Message-ID: 496FB8E9.4010203@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> Having debug symbols would be more useful, but unless the binary is
> totally stripped, a backtrace might provide enough info without that.
> Try it and see if you get any function names in the trace, or only
> numbers.
>
> (BTW, does Debian have anything comparable to Red Hat's debuginfo
> packages? You might be able to get the debug symbols without having
> to recompile for yourself. Recompile is a bit of a pain since you have
> to take care to match the original compilation options exactly.)
>
> regards, tom lane

Alrighty. Here's what I got (assuming I did this right). My untrained
eyes see some stuff regarding memory allocation. I wonder if overly
aggressive memory related tweaks in the config file are causing the
problem? I don't recall making any changes to the config file within a
short time period of the problem starting, but let me know if I need to
post any config settings.

hostname:/var/lib/postgresql/8.1/mc-db2# gdb
/usr/lib/postgresql/8.1/bin/postmaster core
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.8...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.8
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.8...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libkrb5.so.3...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libkrb5.so.3
Reading symbols from /lib/libcom_err.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcom_err.so.2
Reading symbols from /lib/tls/i686/cmov/libcrypt.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libcrypt.so.1
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
Reading symbols from /lib/tls/i686/cmov/libm.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libk5crypto.so.3...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libk5crypto.so.3
Reading symbols from /lib/tls/i686/cmov/libresolv.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libresolv.so.2
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /usr/lib/libkrb5support.so.0...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/libkrb5support.so.0
(no debugging symbols found)
Core was generated by `postgres: autovacuum process
mc_dairyqueen '.
Program terminated with signal 11, Segmentation fault.
#0 0x0827441d in MemoryContextAlloc ()
(gdb) bt
#0 0x0827441d in MemoryContextAlloc ()
#1 0x08274467 in MemoryContextStrdup ()
#2 0x0826501c in database_getflatfilename ()
#3 0x0826504e in database_getflatfilename ()
#4 0x08265ec1 in AtEOXact_UpdateFlatFiles ()
#5 0x080a9111 in RecordTransactionCommit ()
#6 0x080a93a7 in CommitTransactionCommand ()
#7 0x081a6c3b in autovac_stopped ()
#8 0x081a75cd in autovac_start ()
#9 0x081ae33c in ClosePostmasterPorts ()
#10 0x081af058 in PostmasterMain ()
#11 0x0816b3e2 in main ()


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 22:55:26
Message-ID: 13734.1232060126@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
> Program terminated with signal 11, Segmentation fault.
> #0 0x0827441d in MemoryContextAlloc ()
> (gdb) bt
> #0 0x0827441d in MemoryContextAlloc ()
> #1 0x08274467 in MemoryContextStrdup ()
> #2 0x0826501c in database_getflatfilename ()
> #3 0x0826504e in database_getflatfilename ()
> #4 0x08265ec1 in AtEOXact_UpdateFlatFiles ()
> #5 0x080a9111 in RecordTransactionCommit ()
> #6 0x080a93a7 in CommitTransactionCommand ()
> #7 0x081a6c3b in autovac_stopped ()
> #8 0x081a75cd in autovac_start ()
> #9 0x081ae33c in ClosePostmasterPorts ()
> #10 0x081af058 in PostmasterMain ()
> #11 0x0816b3e2 in main ()

Hmm. This isn't very trustworthy for lack of debug symbols (what we're
probably looking at are the nearest global function names before the
actual locations). However, it strongly suggests that something is
broken in the active memory context, and the most likely explanations
for that are either a memory clobber (eg overrunning the requested size
of a chunk) or CurrentMemoryContext pointing at a context that was
already freed. The latter theory ties into the fact that it seems to be
happening during transaction end. But any such bug of either type
should have been found years ago given that development is invariably
done with CLOBBER_FREED_MEMORY enabled.

Alvaro, any thoughts? Remember this is 8.1.15.

Justin, if you do feel like recompiling, please do so with original
configure options (run pg_config to verify) and add --enable-cassert
as well as --enable-debug. That might give us more to work with.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Justin Pasher <justinp(at)newmediagateway(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 23:23:45
Message-ID: 20090115232345.GJ6440@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:

> Hmm. This isn't very trustworthy for lack of debug symbols (what we're
> probably looking at are the nearest global function names before the
> actual locations). However, it strongly suggests that something is
> broken in the active memory context, and the most likely explanations
> for that are either a memory clobber (eg overrunning the requested size
> of a chunk) or CurrentMemoryContext pointing at a context that was
> already freed. The latter theory ties into the fact that it seems to be
> happening during transaction end. But any such bug of either type
> should have been found years ago given that development is invariably
> done with CLOBBER_FREED_MEMORY enabled.
>
> Alvaro, any thoughts? Remember this is 8.1.15.

Not really. It seems like this must be happening on the vicinity of
process_whole_db(), which is a less used code path than do_autovacuum(),
so it's more likely to have bugs. I don't see anything obviously wrong
though.

I note that process_whole_db is not changing to AutovacMemCxt the way
do_autovacuum() does, but I don't see any way that this could cause a
problem.

Hmm, vacuum() creates a new memory context under PortalContext, but I
don't see that one set anywhere on the autovacuum path ... is that
bogus?

The lack of debug symbols makes this all mere guesses though. The
backtrace did not make a lot of sense to me.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Justin Pasher <justinp(at)newmediagateway(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-15 23:44:04
Message-ID: 14418.1232063044@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> Hmm. This isn't very trustworthy for lack of debug symbols (what we're
>> probably looking at are the nearest global function names before the
>> actual locations).

> The lack of debug symbols makes this all mere guesses though. The
> backtrace did not make a lot of sense to me.

I read it like this:

#0 0x0827441d in MemoryContextAlloc () <-- real
#1 0x08274467 in MemoryContextStrdup () <-- real
#2 0x0826501c in database_getflatfilename () <-- real
#3 0x0826504e in database_getflatfilename () <-- must be write_database_file
#4 0x08265ec1 in AtEOXact_UpdateFlatFiles () <-- real
#5 0x080a9111 in RecordTransactionCommit () <-- must be CommitTransaction
#6 0x080a93a7 in CommitTransactionCommand () <-- real
#7 0x081a6c3b in autovac_stopped () <-- must be process_whole_db
#8 0x081a75cd in autovac_start () <-- real
#9 0x081ae33c in ClosePostmasterPorts () <-- must be ServerLoop
#10 0x081af058 in PostmasterMain ()
#11 0x0816b3e2 in main ()

although this requires one or two leaps of faith about single-call
static functions getting inlined so that they don't produce a callstack
entry (in particular that must have happened to AutoVacMain). In any
case, it's very hard to see how MemoryContextAlloc would dump core
unless the method pointer of the context it was pointed to was
clobbered. So I'm pretty sure that's what happened, and now we must
work backwards to how it happened,

Justin, it's entirely possible that the only way we'll figure it out
is for a developer to go poking at the entrails. Are you in a position
to give Alvaro or me ssh access to your test machine?

regards, tom lane


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 00:05:39
Message-ID: 496FCF53.6040509@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> I read it like this:
>
> #0 0x0827441d in MemoryContextAlloc () <-- real
> #1 0x08274467 in MemoryContextStrdup () <-- real
> #2 0x0826501c in database_getflatfilename () <-- real
> #3 0x0826504e in database_getflatfilename () <-- must be write_database_file
> #4 0x08265ec1 in AtEOXact_UpdateFlatFiles () <-- real
> #5 0x080a9111 in RecordTransactionCommit () <-- must be CommitTransaction
> #6 0x080a93a7 in CommitTransactionCommand () <-- real
> #7 0x081a6c3b in autovac_stopped () <-- must be process_whole_db
> #8 0x081a75cd in autovac_start () <-- real
> #9 0x081ae33c in ClosePostmasterPorts () <-- must be ServerLoop
> #10 0x081af058 in PostmasterMain ()
> #11 0x0816b3e2 in main ()
>
> although this requires one or two leaps of faith about single-call
> static functions getting inlined so that they don't produce a callstack
> entry (in particular that must have happened to AutoVacMain). In any
> case, it's very hard to see how MemoryContextAlloc would dump core
> unless the method pointer of the context it was pointed to was
> clobbered. So I'm pretty sure that's what happened, and now we must
> work backwards to how it happened,
>
> Justin, it's entirely possible that the only way we'll figure it out
> is for a developer to go poking at the entrails. Are you in a position
> to give Alvaro or me ssh access to your test machine?
>
> regards, tom lane
>

I'm currently working on recompiling Postgres with the new configure
parameters. I'm trying to go the easier route by downloading the Debian
source package, add the new options, compile, then install the package.
Hopefully this will give the closest possible binary to the current one.
Incidentally, the --enable-debug option is already set for the Debian
package (I did have to add --enable-cassert though). I'll let you know
once I get it up if things work properly.

As far as access to the machine, I'll contact you off-list if I can work
something out for that. The data is not overly sensitive, but it's still
client data nonetheless. I'll try to make a copy of the cluster and try
to reduce the database count and see if I can still duplicate the problem.

Thanks.

Justin Pasher


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 21:34:22
Message-ID: 4970FD5E.4000905@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> I read it like this:
>
> #0 0x0827441d in MemoryContextAlloc () <-- real
> #1 0x08274467 in MemoryContextStrdup () <-- real
> #2 0x0826501c in database_getflatfilename () <-- real
> #3 0x0826504e in database_getflatfilename () <-- must be write_database_file
> #4 0x08265ec1 in AtEOXact_UpdateFlatFiles () <-- real
> #5 0x080a9111 in RecordTransactionCommit () <-- must be CommitTransaction
> #6 0x080a93a7 in CommitTransactionCommand () <-- real
> #7 0x081a6c3b in autovac_stopped () <-- must be process_whole_db
> #8 0x081a75cd in autovac_start () <-- real
> #9 0x081ae33c in ClosePostmasterPorts () <-- must be ServerLoop
> #10 0x081af058 in PostmasterMain ()
> #11 0x0816b3e2 in main ()
>
> although this requires one or two leaps of faith about single-call
> static functions getting inlined so that they don't produce a callstack
> entry (in particular that must have happened to AutoVacMain). In any
> case, it's very hard to see how MemoryContextAlloc would dump core
> unless the method pointer of the context it was pointed to was
> clobbered. So I'm pretty sure that's what happened, and now we must
> work backwards to how it happened,
>
> Justin, it's entirely possible that the only way we'll figure it out
> is for a developer to go poking at the entrails. Are you in a position
> to give Alvaro or me ssh access to your test machine?
>
> regards, tom lane

OK. Here's an update on this.

I was able to reduce the database cluster down to just one real database
(aside from template0/1 and postgres) and I was still getting the
segfault. I was even able to delete all the data from a lot of the
sensitive tables and still get the segfault. At least this means it's
easier for me to give access to the DB now if need be.

I recompiled from the Debian source package and added --enable-cassert
(--enable-debug was already there). I replaced the Debian standard
packages with the recompiled versions and started up the cluster. Now it
is hitting a failure on one of the assert lines, and the log message is
a little different.

2009-01-16 15:24:48 CST LOG: transaction ID wrap limit is 1076038431,
limited by database "template1"
TRAP: BadArgument("!(((context) != ((void *)0) &&
(((((Node*)((context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 502)
2009-01-16 15:24:52 CST LOG: autovacuum process (PID 7066) was
terminated by signal 6
2009-01-16 15:24:52 CST LOG: terminating any other active server processes

A new backtrace from the core dump is below, although it looks almost
identical to me.

------------------------------
hostname:/var/lib/postgresql/8.1# gdb
/usr/lib/postgresql/8.1/bin/postmaster mc-db2/core
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.8...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.8
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.8...(no
debugging symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libkrb5.so.3...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libkrb5.so.3
Reading symbols from /lib/libcom_err.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libcom_err.so.2
Reading symbols from /lib/tls/i686/cmov/libcrypt.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libcrypt.so.1
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
Reading symbols from /lib/tls/i686/cmov/libm.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/tls/i686/cmov/libc.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /usr/lib/libk5crypto.so.3...
(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libk5crypto.so.3
Reading symbols from /lib/tls/i686/cmov/libresolv.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/tls/i686/cmov/libresolv.so.2
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /usr/lib/libkrb5support.so.0...(no debugging
symbols found)...done.
Loaded symbols for /usr/lib/libkrb5support.so.0
(no debugging symbols found)
Core was generated by `postgres: autovacuum process
mc_itec '.
Program terminated with signal 6, Aborted.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7c37811 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7c38fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0828cdf3 in ExceptionalCondition ()
#4 0x082a8cd2 in MemoryContextAlloc ()
#5 0x082a8d67 in MemoryContextStrdup ()
#6 0x0829749c in database_getflatfilename ()
#7 0x082974ce in database_getflatfilename ()
#8 0x08298341 in AtEOXact_UpdateFlatFiles ()
#9 0x080bcc81 in RecordTransactionCommit ()
#10 0x080bcf8f in CommitTransactionCommand ()
#11 0x081cd1eb in autovac_stopped ()
#12 0x081cdbcd in autovac_start ()
#13 0x081d4c0c in ClosePostmasterPorts ()
#14 0x081d5968 in PostmasterMain ()
#15 0x0818bd22 in main ()

Justin Pasher


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 22:31:24
Message-ID: 13508.1232145084@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
> I recompiled from the Debian source package and added --enable-cassert
> (--enable-debug was already there). I replaced the Debian standard
> packages with the recompiled versions and started up the cluster. Now it
> is hitting a failure on one of the assert lines, and the log message is
> a little different.

> TRAP: BadArgument("!(((context) != ((void *)0) &&
> (((((Node*)((context)))->type) == T_AllocSetContext))))", File:
> "mcxt.c", Line: 502)

Well, that confirms the thought that the memory context is clobbered,
but we're not any closer to understanding why.

> #1 0xb7c37811 in raise () from /lib/tls/i686/cmov/libc.so.6
> #2 0xb7c38fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
> #3 0x0828cdf3 in ExceptionalCondition ()
> #4 0x082a8cd2 in MemoryContextAlloc ()
> #5 0x082a8d67 in MemoryContextStrdup ()
> #6 0x0829749c in database_getflatfilename ()
> #7 0x082974ce in database_getflatfilename ()
> #8 0x08298341 in AtEOXact_UpdateFlatFiles ()
> #9 0x080bcc81 in RecordTransactionCommit ()
> #10 0x080bcf8f in CommitTransactionCommand ()
> #11 0x081cd1eb in autovac_stopped ()
> #12 0x081cdbcd in autovac_start ()
> #13 0x081d4c0c in ClosePostmasterPorts ()
> #14 0x081d5968 in PostmasterMain ()
> #15 0x0818bd22 in main ()

... and you've seemingly not managed to install the debug symbols where
gdb can find them.

regards, tom lane


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 23:13:17
Message-ID: 4971148D.9030808@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> Justin Pasher <justinp(at)newmediagateway(dot)com> writes:
>
>> I recompiled from the Debian source package and added --enable-cassert
>> (--enable-debug was already there). I replaced the Debian standard
>> packages with the recompiled versions and started up the cluster. Now it
>> is hitting a failure on one of the assert lines, and the log message is
>> a little different.
>>
>> TRAP: BadArgument("!(((context) != ((void *)0) &&
>> (((((Node*)((context)))->type) == T_AllocSetContext))))", File:
>> "mcxt.c", Line: 502)
>>
>
> Well, that confirms the thought that the memory context is clobbered,
> but we're not any closer to understanding why.
>
>
>> #1 0xb7c37811 in raise () from /lib/tls/i686/cmov/libc.so.6
>> #2 0xb7c38fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
>> #3 0x0828cdf3 in ExceptionalCondition ()
>> #4 0x082a8cd2 in MemoryContextAlloc ()
>> #5 0x082a8d67 in MemoryContextStrdup ()
>> #6 0x0829749c in database_getflatfilename ()
>> #7 0x082974ce in database_getflatfilename ()
>> #8 0x08298341 in AtEOXact_UpdateFlatFiles ()
>> #9 0x080bcc81 in RecordTransactionCommit ()
>> #10 0x080bcf8f in CommitTransactionCommand ()
>> #11 0x081cd1eb in autovac_stopped ()
>> #12 0x081cdbcd in autovac_start ()
>> #13 0x081d4c0c in ClosePostmasterPorts ()
>> #14 0x081d5968 in PostmasterMain ()
>> #15 0x0818bd22 in main ()
>>
>
> ... and you've seemingly not managed to install the debug symbols where
> gdb can find them.
>
> regards, tom lane
>

Dang it. I wonder why the --enable-debug option doesn't seem to actually
be enabling debug. :( For reference, here is the configure command that
the package uses according to the config.log (in case you spot anything
wrong).

/usr/src/postgres-8.1.15/postgresql-8.1-8.1.15/build-tree/postgresql-8.1.15/configure
--build=i486-linux-gnu --prefix=/usr --includedir=${prefix}/include
--mandir=${prefix}/share/man --infodir=${prefix}/share/info
--sysconfdir=/etc --localstatedir=/var
--libexecdir=${prefix}/lib/postgresql-8.1 --disable-maintainer-mode
--disable-dependency-tracking --srcdir=.
--mandir=${prefix}/share/postgresql/8.1/man
--with-docdir=${prefix}/share/doc/postgresql-doc-8.1
--datadir=${prefix}/share/postgresql/8.1
--bindir=${prefix}/lib/postgresql/8.1/bin
--includedir=${prefix}/include/postgresql/ --enable-nls
--enable-integer-datetimes --enable-thread-safety --enable-debug
--enable-cassert --disable-rpath --with-tcl --with-perl --with-python
--with-pam --with-krb5 --with-openssl --with-gnu-ld
--with-tclconfig=/usr/lib/tcl8.4 --with-tkconfig=/usr/lib/tk8.4
--with-includes=/usr/include/tcl8.4 --with-pgport=5432 CFLAGS=-g -Wall
-O2 -fPIC LDFLAGS=-Wl,--as-needed

I'm going to try a compile from scratch from the 8.1.15 source tarball
and see if I can get better results. I'll let you know (probably next
week). Thanks.

Justin Pasher


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 23:25:54
Message-ID: 20090116232554.GE12449@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Justin Pasher wrote:

> Dang it. I wonder why the --enable-debug option doesn't seem to actually
> be enabling debug. :( For reference, here is the configure command that
> the package uses according to the config.log (in case you spot anything
> wrong).

Maybe the executable is getting stripped? (I think it wouldn't show the
function names in that case, but maybe this gives you an idea)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Justin Pasher <justinp(at)newmediagateway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 23:29:35
Message-ID: 4971185F.5010503@newmediagateway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
>
>> #1 0xb7c37811 in raise () from /lib/tls/i686/cmov/libc.so.6
>> #2 0xb7c38fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
>> #3 0x0828cdf3 in ExceptionalCondition ()
>> #4 0x082a8cd2 in MemoryContextAlloc ()
>> #5 0x082a8d67 in MemoryContextStrdup ()
>> #6 0x0829749c in database_getflatfilename ()
>> #7 0x082974ce in database_getflatfilename ()
>> #8 0x08298341 in AtEOXact_UpdateFlatFiles ()
>> #9 0x080bcc81 in RecordTransactionCommit ()
>> #10 0x080bcf8f in CommitTransactionCommand ()
>> #11 0x081cd1eb in autovac_stopped ()
>> #12 0x081cdbcd in autovac_start ()
>> #13 0x081d4c0c in ClosePostmasterPorts ()
>> #14 0x081d5968 in PostmasterMain ()
>> #15 0x0818bd22 in main ()
>
> ... and you've seemingly not managed to install the debug symbols where
> gdb can find them.
>
> regards, tom lane

OK. I recompiled from source (I didn't expect it to go so quickly and
smoothly) and it looks like I might have the correct info now. Here's
the new backtrace. Let me know if I still don't have it right. The
configure is below to (for reference).

./configure --disable-maintainer-mode --disable-dependency-tracking
--enable-nls --enable-integer-datetimes --enable-thread-safety
--enable-debug --enable-cassert --disable-rpath --with-tcl --with-perl
--with-python --with-pam --with-krb5 --with-openssl --with-gnu-ld
--with-tclconfig=/usr/lib/tcl8.4 --with-tkconfig=/usr/lib/tk8.4
--with-includes=/usr/include/tcl8.4

------------------------------
postgres(at)hostname:/usr/local/pgsql/bin$ gdb
/usr/local/pgsql/bin/postmaster /var/lib/postgresql/8.1/mc-db2/core
GNU gdb 6.4.90-debian
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...Using host libthread_db
library "/lib/tls/i686/cmov/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.8...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.8
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.8
Reading symbols from /usr/lib/libkrb5.so.3...done.
Loaded symbols for /usr/lib/libkrb5.so.3
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libreadline.so.5...done.
Loaded symbols for /lib/libreadline.so.5
Reading symbols from /lib/tls/i686/cmov/libcrypt.so.1...done.
Loaded symbols for /lib/tls/i686/cmov/libcrypt.so.1
Reading symbols from /lib/tls/i686/cmov/libresolv.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libresolv.so.2
Reading symbols from /lib/tls/i686/cmov/libnsl.so.1...done.
Loaded symbols for /lib/tls/i686/cmov/libnsl.so.1
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
Reading symbols from /lib/tls/i686/cmov/libm.so.6...done.
Loaded symbols for /lib/tls/i686/cmov/libm.so.6
Reading symbols from /lib/tls/i686/cmov/libc.so.6...done.
Loaded symbols for /lib/tls/i686/cmov/libc.so.6
Reading symbols from /lib/libcom_err.so.2...done.
Loaded symbols for /lib/libcom_err.so.2
Reading symbols from /usr/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/lib/libk5crypto.so.3
Reading symbols from /lib/libncurses.so.5...done.
Loaded symbols for /lib/libncurses.so.5
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /usr/lib/libkrb5support.so.0...done.
Loaded symbols for /usr/lib/libkrb5support.so.0
Core was generated by `postgres: autovacuum process
mc_itec '.
Program terminated with signal 6, Aborted.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7b1d811 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7b1efb9 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0825f4a1 in ExceptionalCondition (conditionName=0x8324b6c
"!(((context) != ((void *)0) && (((((Node*)((context)))->type) ==
T_AllocSetContext))))",
errorType=0x82885e0 "BadArgument", fileName=0x8319d04 "mcxt.c",
lineNumber=502) at assert.c:51
#4 0x0827869e in MemoryContextAlloc (context=0x838e42c, size=19) at
mcxt.c:502
#5 0x0827872c in MemoryContextStrdup (context=0x838e42c,
string=0x8318833 "global/pg_database") at mcxt.c:655
#6 0x08268a4c in database_getflatfilename () at flatfiles.c:112
#7 0x08268a63 in write_database_file (drel=0x0, startup=6 '\006') at
flatfiles.c:194
#8 0x082697f2 in AtEOXact_UpdateFlatFiles (isCommit=1 '\001') at
flatfiles.c:815
#9 0x080b6741 in CommitTransaction () at xact.c:1470
#10 0x080b6a29 in CommitTransactionCommand () at xact.c:2184
#11 0x081b0f42 in AutoVacMain (argc=<value optimized out>, argv=<value
optimized out>) at autovacuum.c:559
#12 0x081b1879 in autovac_start () at autovacuum.c:174
#13 0x081b7f78 in ServerLoop () at postmaster.c:1269
#14 0x081b8bad in PostmasterMain (argc=3, argv=0x836b508) at
postmaster.c:943
#15 0x08175609 in main (argc=3, argv=0x836b508) at main.c:265

Justin Pasher


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-16 23:43:09
Message-ID: 15221.1232149389@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I wrote:
> ... and you've seemingly not managed to install the debug symbols where
> gdb can find them.

But never mind that --- it turns out to be trivial to reproduce the
crash. Just create a database, set its datfrozenxid and datvacuumxid
far in the past (via a manual update of pg_database), enable autovacuum,
and wait a bit.

What is happening is that autovacuum_do_vac_analyze contains

old_cxt = MemoryContextSwitchTo(AutovacMemCxt);
...
vacuum(vacstmt, relids);
...
MemoryContextSwitchTo(old_cxt);

and at the time it is called by process_whole_db, CurrentMemoryContext
points at TopTransactionContext. Which gets destroyed because vacuum()
internally finishes that transaction and starts a new one. When we
come out of vacuum(), CurrentMemoryContext again points at
TopTransactionContext, but *its not the same one*. The closing
MemoryContextSwitchTo is installing a stale pointer, which then remains
active into CommitTransaction. It's a wonder this code ever works.

The other path through do_autovacuum() escapes this fate because it
enters autovacuum_do_vac_analyze with CurrentMemoryContext pointing
at AutovacMemCxt, which isn't going to go away.

I argue that autovacuum_do_vac_analyze shouldn't attempt to restore the
caller's memory context at all. One possible approach is to make it
re-select AutovacMemCxt at exit, but I wonder if we shouldn't define
its entry and exit conditions as current context being
(the current instance of) TopTransactionContext.

It looks like 8.3 and HEAD take the latter approach and are therefore
safe from this bug. 8.2 seems to escape it also because it doesn't have
process_whole_db anymore, but it's certainly not
autovacuum_do_vac_analyze fault that it's not broken, because it's still
trying to restore a context that it has no right to assume still exists.

Alvaro, you want to take charge of fixing this?

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Justin Pasher <justinp(at)newmediagateway(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-17 04:14:33
Message-ID: 20090117041433.GI12449@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:

> What is happening is that autovacuum_do_vac_analyze contains
>
> old_cxt = MemoryContextSwitchTo(AutovacMemCxt);
> ...
> vacuum(vacstmt, relids);
> ...
> MemoryContextSwitchTo(old_cxt);
>
> and at the time it is called by process_whole_db, CurrentMemoryContext
> points at TopTransactionContext. Which gets destroyed because vacuum()
> internally finishes that transaction and starts a new one. When we
> come out of vacuum(), CurrentMemoryContext again points at
> TopTransactionContext, but *its not the same one*. The closing
> MemoryContextSwitchTo is installing a stale pointer, which then remains
> active into CommitTransaction. It's a wonder this code ever works.

Hmm, in retrospect this is pretty obviously buggy. I can't say that
it's that easy for me to reproduce it though; I definitely can't make it
crash. Maybe by sheer luck, the new TopTransactionContext pointer
points to the same memory area that the old was stored in.

I think this patch should fix it. Justin, would you try it and report
back? I would commit it right away since it seems simple enough, but
since I can't reproduce the crash, I prefer external confirmation first
:-)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
autovacuum-mcxt-81.patch text/x-diff 1.4 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 08:54:38
Message-ID: 1232182478.31669.182.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


On Fri, 2009-01-16 at 18:43 -0500, Tom Lane wrote:

> What is happening is that autovacuum_do_vac_analyze contains
>
> old_cxt = MemoryContextSwitchTo(AutovacMemCxt);
> ...
> vacuum(vacstmt, relids);
> ...
> MemoryContextSwitchTo(old_cxt);
>
> and at the time it is called by process_whole_db, CurrentMemoryContext
> points at TopTransactionContext. Which gets destroyed because
> vacuum()
> internally finishes that transaction and starts a new one. When we
> come out of vacuum(), CurrentMemoryContext again points at
> TopTransactionContext, but *its not the same one*. The closing
> MemoryContextSwitchTo is installing a stale pointer, which then
> remains active into CommitTransaction. It's a wonder this code ever
> works.

Can we add something to memory contexts to make this fail every time?

Seems like we should be able to detect this class of error. If we can't
it seems likely that a number of similar cases exist with other
contexts.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Justin Pasher <justinp(at)newmediagateway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-17 11:26:21
Message-ID: 20090117112621.GB2619@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Fri, Jan 16, 2009 at 05:13:17PM -0600, Justin Pasher wrote:
> Dang it. I wonder why the --enable-debug option doesn't seem to actually
> be enabling debug. :( For reference, here is the configure command that
> the package uses according to the config.log (in case you spot anything
> wrong).

For future reference, if you want debug symbols in your debian packages
you need to build with DEB_BUILD_OPTIONS=nostrip in the environment.
You can also remove the reference to dh_strip in the build file.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Justin Pasher <justinp(at)newmediagateway(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-17 15:49:54
Message-ID: 12265.1232207394@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Hmm, in retrospect this is pretty obviously buggy. I can't say that
> it's that easy for me to reproduce it though; I definitely can't make it
> crash. Maybe by sheer luck, the new TopTransactionContext pointer
> points to the same memory area that the old was stored in.

Yeah, there could be some platform dependency involved. I'm guessing
different structs that happen to fall into the same palloc size category
on one platform but not another.

Anyway, it happens consistently on my HP box. I find that your proposed
patch fixes it, but makes the "normal" path crash :-( --- the loop in
do_autovacuum has to be executed in AutovacMemCxt, because it creates an
Oid List that gets passed to vacuum() and had better not be in a
transaction-lifetime context. The attached modified patch works for me.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 1.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 16:35:18
Message-ID: 12906.1232210118@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Can we add something to memory contexts to make this fail every time?

No, not really. AFAICS the reason for Alvaro not seeing it must be that
on his machine the new transaction happens to allocate its
TopTransactionContext control block right in the same place where the
old one was.

We could have a debugging mode in which pfree'd space is never recycled
for reuse (just filled with 0xdeadbeef or whatever and left to sit).
But it would not be practical for anything except short debugging
sessions. In fact, for a case like this, only *very* short debugging
sessions, because you couldn't even recycle after transaction commit.
So the chance of finding problems in seldom-exercised code paths, as
this one is, would be pretty small anyway.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 17:01:28
Message-ID: 1232211688.31669.279.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


On Sat, 2009-01-17 at 11:35 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Can we add something to memory contexts to make this fail every time?
>
> No, not really. AFAICS the reason for Alvaro not seeing it must be that
> on his machine the new transaction happens to allocate its
> TopTransactionContext control block right in the same place where the
> old one was.

Can we put a identifier into header of each control block, an ascending
value that is unlikely duplicate between calls? That way we'd be able to
tell immediately it wasn't the same block, even if it is in the same
place. (In assert mode only, of course).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 17:15:46
Message-ID: 20090117171546.GJ12449@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Can we add something to memory contexts to make this fail every time?
>
> No, not really. AFAICS the reason for Alvaro not seeing it must be that
> on his machine the new transaction happens to allocate its
> TopTransactionContext control block right in the same place where the
> old one was.

But freed memory is clobbered, so if we were to have an assert that
checks the node tag, it should show up. In fact, we do have such an
assert, but only for compilers other than GCC, because the inline
version of palloc() cannot have it for lack of infrastructure.

Maybe we should patch palloc.h so that it only uses the inline version
when not in assert mode. Something like the attached, except that for
some reason the test case still fails to report any error for me :-(
Maybe the cpp boolean logic is fooling me.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
palloc-inline.patch text/x-diff 1.7 KB

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 17:55:20
Message-ID: 87ocy5ddh3.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Can we add something to memory contexts to make this fail every time?
>
> No, not really. AFAICS the reason for Alvaro not seeing it must be that
> on his machine the new transaction happens to allocate its
> TopTransactionContext control block right in the same place where the
> old one was.
>
> We could have a debugging mode in which pfree'd space is never recycled
> for reuse (just filled with 0xdeadbeef or whatever and left to sit).

Hm, I wonder how much more practical it would be if we recycled it but offset
the pointer by maxalign. We would waste 4/8 bytes per palloc/free cycle
instead of the whole chunk.

(Whether palloc could actually do this would be another question, there are
lots of allocator algorithms that wouldn't be able to, I think.)

If we had a more formalized save_memory_context()/restore_memory_context()
which gave you more than just a pointer we could do better for this particular
case. save_memory_context() could hand you a struct with a pointer as well as
some sanity check values about the context you're saving.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 18:58:13
Message-ID: 15343.1232218693@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Can we put a identifier into header of each control block, an ascending
> value that is unlikely duplicate between calls? That way we'd be able to
> tell immediately it wasn't the same block,

Same block than what? Unless you can somehow hide that ID number in
a MemoryContext pointer, the staleness of the pointer won't be evident.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-17 19:03:55
Message-ID: 15427.1232219035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> No, not really. AFAICS the reason for Alvaro not seeing it must be that
>> on his machine the new transaction happens to allocate its
>> TopTransactionContext control block right in the same place where the
>> old one was.

> But freed memory is clobbered, so if we were to have an assert that
> checks the node tag, it should show up. In fact, we do have such an
> assert, but only for compilers other than GCC, because the inline
> version of palloc() cannot have it for lack of infrastructure.

Well, but production installations don't have either memory clobbering
or Asserts, so fooling with that wouldn't have helped anyway. I suspect
what really happened here is that the bug was created by some late
change during 8.1 development, and nobody ever exercised the
anti-wraparound code path after that in an assert-enabled build :-(
In a non-assert build there's a fairly good chance that it'd still
work because the context header would still be there undamaged.

One thing we could try that would cost a lot less than de-inlining
palloc is to have MemoryContextDelete intentionally zero the methods
pointer. That still does nothing for the create-new-context-in-same-
spot issue, but at least it would catch palloc in a context that had
been deleted and not yet recycled.

regards, tom lane


From: "Justin Pasher" <justinp(at)newmediagateway(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Alvaro Herrera'" <alvherre(at)commandprompt(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-17 21:41:47
Message-ID: 626B243E214B4F3483B014FD7E7AD0CF@taz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Saturday, January 17, 2009 9:50 AM
> To: Alvaro Herrera
> Cc: Justin Pasher; pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Autovacuum daemon terminated by signal 11
>
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Hmm, in retrospect this is pretty obviously buggy. I can't say that
> > it's that easy for me to reproduce it though; I definitely can't make it
> > crash. Maybe by sheer luck, the new TopTransactionContext pointer
> > points to the same memory area that the old was stored in.
>
> Yeah, there could be some platform dependency involved. I'm guessing
> different structs that happen to fall into the same palloc size category
> on one platform but not another.
>
> Anyway, it happens consistently on my HP box. I find that your proposed
> patch fixes it, but makes the "normal" path crash :-( --- the loop in
> do_autovacuum has to be executed in AutovacMemCxt, because it creates an
> Oid List that gets passed to vacuum() and had better not be in a
> transaction-lifetime context. The attached modified patch works for me.
>
> regards, tom lane

I tried both Alvaro's patch and your patch, and I actually got the same
results from both. I didn't experience a crash when autovacuum kicked in or
when manually performing a vacuum on the database (if that's what you meant
by the "normal path"). At any rate, everything seems to be working properly
for me with the patch. Thanks!

Justin Pasher


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Justin Pasher" <justinp(at)newmediagateway(dot)com>
Cc: "'Alvaro Herrera'" <alvherre(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-18 06:15:42
Message-ID: 23774.1232259342@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

"Justin Pasher" <justinp(at)newmediagateway(dot)com> writes:
>> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>> Anyway, it happens consistently on my HP box. I find that your proposed
>> patch fixes it, but makes the "normal" path crash :-( --- the loop in
>> do_autovacuum has to be executed in AutovacMemCxt, because it creates an
>> Oid List that gets passed to vacuum() and had better not be in a
>> transaction-lifetime context. The attached modified patch works for me.

> I tried both Alvaro's patch and your patch, and I actually got the same
> results from both.

Alvaro's patch would most likely not crash in a non-assert-enabled
build, since the list wouldn't be physically damaged when the containing
context got freed. But obviously that's a risky thing ...

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Justin Pasher <justinp(at)newmediagateway(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Autovacuum daemon terminated by signal 11
Date: 2009-01-20 12:18:39
Message-ID: 20090120121839.GA3940@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:

> Anyway, it happens consistently on my HP box. I find that your proposed
> patch fixes it, but makes the "normal" path crash :-( --- the loop in
> do_autovacuum has to be executed in AutovacMemCxt, because it creates an
> Oid List that gets passed to vacuum() and had better not be in a
> transaction-lifetime context. The attached modified patch works for me.

Committed this patch to both 8.1 and 8.2. Thanks.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: MemoryContextSwitchTo (Re: [GENERAL] Autovacuum daemon terminated by signal 11)
Date: 2009-01-20 12:23:30
Message-ID: 20090120122330.GB3940@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:

> > But freed memory is clobbered, so if we were to have an assert that
> > checks the node tag, it should show up. In fact, we do have such an
> > assert, but only for compilers other than GCC, because the inline
> > version of palloc() cannot have it for lack of infrastructure.
>
> Well, but production installations don't have either memory clobbering
> or Asserts, so fooling with that wouldn't have helped anyway. I suspect
> what really happened here is that the bug was created by some late
> change during 8.1 development, and nobody ever exercised the
> anti-wraparound code path after that in an assert-enabled build :-(
> In a non-assert build there's a fairly good chance that it'd still
> work because the context header would still be there undamaged.

Well, my builds are all assert-enabled, and I still wasn't able to make
it crash in any way (the new context being allocated in the same
position as the old one is the only explanation I have, but I did not
investigate whether that's what happening). Maybe Greg Stark's idea of
offsetting pointers returned by palloc could have helped to find the
problem from the outset.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support