seg fault crashed the postmaster

Lists: pgsql-general
From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: seg fault crashed the postmaster
Date: 2010-12-31 00:52:45
Message-ID: 1293756765858-3323117.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Hi,

Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
night. About 20 hours later, an update statement seg faulted and crashed
the server. This is a typical update that has worked fine for a long time.

20898 datafeed (58628) 2010-12-30 19:28:14 EST [103]LOG: process 20898
acquired ShareLock on transaction 1286738762 after 90174.969 ms
20898 datafeed (58628) 2010-12-30 19:28:14 EST [104]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 34409854 and status != 'S' limit 10000)
5802 2010-12-30 19:28:14 EST [4]LOG: server process (PID 20898) was
terminated by signal 11: Segmentation fault
5802 2010-12-30 19:28:14 EST [5]LOG: terminating any other active server
processes
15426 pipeline (36834) 2010-12-30 19:28:14 EST [1]WARNING: terminating
connection because of crash of another server process15426 pipeline
10.10.11.54(36834) 2010-12-30 19:28:14 EST [2]DETAIL: The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server pro
cess exited abnormally and possibly corrupted shared memory.

One hint that something might be wrong was this:

18235 datafeed (44228) 2010-12-30 18:16:37 EST [11]LOG: process 18235
acquired ShareLock on transaction 1285952031 after 29966.703 ms
18235 datafeed (44228) 2010-12-30 18:16:37 EST [12]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 10000)
18235 datafeed (44228) 2010-12-30 18:16:43 EST [13]ERROR: compressed data
is corrupt
18235 datafeed (44228) 2010-12-30 18:16:43 EST [14]STATEMENT: update
v_messages set status = 'S', updated_on = now() where id in (select id from
v_messages where author_id = 25301995 and status != 'S' limit 10000)

How concerned should I be? Thanks!

Gordon

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323117.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 01:51:56
Message-ID: 15215.1293760316@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
> night. About 20 hours later, an update statement seg faulted and crashed
> the server. This is a typical update that has worked fine for a long time.

Could we see a stack trace from that? Or at least a self-contained
test case?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 02:25:21
Message-ID: AANLkTinUPwvNTiNKAF362gwUrx0e5h5FUUoNaYqE5C1W@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


I'd love to send you a stack trace. Any suggestions on how to get one? It
has since happened again, on the same update command, so I'm guessing I can
repeat it.

On Thu, Dec 30, 2010 at 6:52 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323151-436577542-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323151-436577542-56124(at)n5(dot)nabble(dot)com>
> wrote:

> Gordon Shannon <[hidden email]<http://user/SendEmail.jtp?type=node&node=3323151&i=0>>
> writes:
> > Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
> > night. About 20 hours later, an update statement seg faulted and crashed
>
> > the server. This is a typical update that has worked fine for a long
> time.
>
> Could we see a stack trace from that? Or at least a self-contained
> test case?
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list ([hidden email]<http://user/SendEmail.jtp?type=node&node=3323151&i=1>)
>
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
> ------------------------------
> View message @
> http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323151.html
> To unsubscribe from seg fault crashed the postmaster, click here<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3323117&code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=>.
>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323171.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 02:31:53
Message-ID: 15960.1293762713@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> I'd love to send you a stack trace. Any suggestions on how to get one? It
> has since happened again, on the same update command, so I'm guessing I can
> repeat it.

http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 07:06:01
Message-ID: AANLkTi==UP4NbA0upKyuStAg5YamiTkY4wXdvB--48tv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Stack trace:

#0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
#1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at
/usr/include/bits/string3.h:51
#2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at
execMain.c:1794
#4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420,
epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4,
tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
#5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at
nodeModifyTable.c:527
#6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at
execProcnode.c:359
#8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990,
direction=1039265768, count=0) at execMain.c:1190
#9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768,
count=0) at execMain.c:280
#10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990,
direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on =
now() where id in (select id from v_messages where author_id = 33138761 and
status != 'S' limit 10000)",
params=<value optimized out>, dest=0x2112ae40,
completionTag=0x7fffb29ace20 "") at pquery.c:197
#12 0x00000000005f9e99 in PortalRunMulti (portal=0x1b32aed0, isTopLevel=1
'\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20
"") at pquery.c:1268
#13 0x00000000005fa965 in PortalRun (portal=0x1b32aed0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2112ae40,
altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:822
#14 0x00000000005f7455 in exec_simple_query (
query_string=0x1b3e59e0 "update v_messages set status = 'S', updated_on
= now() where id in (select id from v_messages where author_id = 33138761
and status != 'S' limit 10000)")
at postgres.c:1058
#15 0x00000000005f7d14 in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, username=<value optimized out>) at
postgres.c:3929
#16 0x00000000005c7ce5 in ServerLoop () at postmaster.c:3555
#17 0x00000000005c89ec in PostmasterMain (argc=5, argv=0x1b31ea00) at
postmaster.c:1092
#18 0x00000000005725fe in main (argc=5, argv=<value optimized out>) at
main.c:188

On Thu, Dec 30, 2010 at 7:32 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323177-1417305259-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323177-1417305259-56124(at)n5(dot)nabble(dot)com>
> wrote:

> Gordon Shannon <[hidden email]<http://user/SendEmail.jtp?type=node&node=3323177&i=0>>
> writes:
> > I'd love to send you a stack trace. Any suggestions on how to get one?
> It
> > has since happened again, on the same update command, so I'm guessing I
> can
> > repeat it.
>
>
> http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list ([hidden email]<http://user/SendEmail.jtp?type=node&node=3323177&i=1>)
>
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
> ------------------------------
> View message @
> http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323177.html
>
> To unsubscribe from seg fault crashed the postmaster, click here<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3323117&code=Z29yZG8xNjlAZ21haWwuY29tfDMzMjMxMTd8LTEwNjcwMjEwNDQ=>.
>
>

--
If I had more time, I could have written you a shorter letter. (Blaise
Pascal)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323277.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 15:34:00
Message-ID: 806.1293809640@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Stack trace:

> #0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
> #1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at
> /usr/include/bits/string3.h:51
> #2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
> #3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at
> execMain.c:1794
> #4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420,
> epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4,
> tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
> #5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at
> nodeModifyTable.c:527
> #6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
> #7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at
> execProcnode.c:359
> #8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990,
> direction=1039265768, count=0) at execMain.c:1190
> #9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768,
> count=0) at execMain.c:280
> #10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990,
> direction=ForwardScanDirection, count=0) at auto_explain.c:203
> #11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
> sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on =
> now() where id in (select id from v_messages where author_id = 33138761 and
> status != 'S' limit 10000)",
> params=<value optimized out>, dest=0x2112ae40,
> completionTag=0x7fffb29ace20 "") at pquery.c:197

Hmm. This suggests that there's something wrong in the EvalPlanQual
code, which gets invoked when there are concurrent updates to the same
row (ie, the row this UPDATE is trying to change is one that was changed
by some other transaction since the query started). That stuff got
rewritten rather thoroughly for 9.0, so the idea of a new bug there
isn't exactly surprising. But it's going to be hard to find without
a test case. Can you show us the full schema for this table and all
the queries that execute against it up till the point of the failure?
(Turning on log_statement across all sessions would help collect that
info, if you don't have it already.)

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 17:31:34
Message-ID: AANLkTik-YyxHu8d_VwBu8bjuWB6NqOKOvGXYrNaZdQoS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Interesting. That's exactly what we have been doing -- trying to update the
same rows in multiple txns. For us to proceed in production, I will take
steps to ensure we stop doing that, as it's just an app bug really.

The table in question -- v_messages -- is an empty base table with 76
partitions, with a total of 2.8 billion rows.
Let me summarize what I see as the key facts here:

(All problems have come from the UPDATE query, all identical except for
different "author_id" values.)

1. We did a "link" upgrade Wed night, from 844 to 902 so the upgrade
happened in place, no data files were copied.
2. The 1st error was "compressed data is corrupt" at 18:16
3. We got 2 seg fault crashes before turning on cores and getting a 3rd
crash with the stack trace.
4. We then got a " invalid memory alloc request size 18446744073449177092"
at 23:50. This was an ERROR, not a crash.

At this point, is it your suspicion that there is a code bug in 9.0.2,
rather than corrupt data?

I will post the schema and then work on a test case.

-gordon

On Fri, Dec 31, 2010 at 8:34 AM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323712-1368244686-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323712-1368244686-56124(at)n5(dot)nabble(dot)com>
> wrote:

>
> Hmm. This suggests that there's something wrong in the EvalPlanQual
> code, which gets invoked when there are concurrent updates to the same
> row (ie, the row this UPDATE is trying to change is one that was changed
> by some other transaction since the query started). That stuff got
> rewritten rather thoroughly for 9.0, so the idea of a new bug there
> isn't exactly surprising. But it's going to be hard to find without
> a test case. Can you show us the full schema for this table and all
> the queries that execute against it up till the point of the failure?
> (Turning on log_statement across all sessions would help collect that
> info, if you don't have it already.)
>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323796.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 17:52:19
Message-ID: 1293817939115-3323804.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Here is the ddl for the tables in question. There are foreign keys to other
tables that I omitted.

http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql
--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323804.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 19:37:32
Message-ID: 4838.1293824252@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Here is the ddl for the tables in question. There are foreign keys to other
> tables that I omitted.

> http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql

Hmmm ... what is "message_status_enum"? Is that an actual enum type, or
some kind of domain over text or varchar? If it's an enum, I'm a bit
baffled where the "corrupt compressed data" message could have come from.

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 20:18:29
Message-ID: AANLkTikv+RCy_8YOSeLpa4fu+nkwid1QBwy2QVBTPEWa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Sorry, I left that out. Yeah, I wondered that too, since these tables do
not use toast.

CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' );

On Fri, Dec 31, 2010 at 12:38 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323859-1425181809-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323859-1425181809-56124(at)n5(dot)nabble(dot)com>
> wrote:

> Hmmm ... what is "message_status_enum"? Is that an actual enum type, or
> some kind of domain over text or varchar? If it's an enum, I'm a bit
> baffled where the "corrupt compressed data" message could have come from.
>
> regards, tom lane
>
> --
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323888.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 20:44:39
Message-ID: 5834.1293828279@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Sorry, I left that out. Yeah, I wondered that too, since these tables do
> not use toast.

Hm. Well, given that the stack trace suggests we're trying to access a
tuple value that's not there (bogus pointer, or data overwritten since
the pointer was created), the "invalid memory alloc request size"
message is believable as another manifestation of the same problem.
The odds seem pretty good that the "corrupt compressed data" message
has the same origin at bottom, although the lack of any obvious data
to be compressed in this table is confusing. Maybe you could get that
from trying to copy over a garbage value of that one varchar column,
though.

I tried to replicate the problem here without success, which suggests to
me that maybe it's plan-dependent. Can you show what EXPLAIN gives you
for the troublesome query?

Another point here is that your logs suggest that the query blocked
behind the previous updater of the row for a very long time --- 90 sec
in one case, 30 sec in another. Is that the typical runtime for one of
these updates? If not, can you identify what the blocking query was?
If it's something different from another instance of the identical
update, it'd be useful to know what.

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 21:06:16
Message-ID: AANLkTimV17a4C_SUuFABErBGPihgUBUScd1uRO=UZfSj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Yes that query does take 30 or 90 secs. I'm pretty sure it was blocking on
its twin update running concurrently. However I'm not really sure how to
identify what "transaction 1283585646" was.

Enclosed is the query plan -- 21000 lines

-gordon

I tried to replicate the problem here without success, which suggests to
> me that maybe it's plan-dependent. Can you show what EXPLAIN gives you
> for the troublesome query?
>
> Another point here is that your logs suggest that the query blocked
> behind the previous updater of the row for a very long time --- 90 sec
> in one case, 30 sec in another. Is that the typical runtime for one of
> these updates? If not, can you identify what the blocking query was?
> If it's something different from another instance of the identical
> update, it'd be useful to know what.
>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323915.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 21:17:03
Message-ID: 6381.1293830223@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Enclosed is the query plan -- 21000 lines

Um ... nope?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 21:42:21
Message-ID: 1293831741629-3323933.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Maybe it doesn't work from gmail. I'll try uploading from here.

http://postgresql.1045698.n5.nabble.com/file/n3323933/plan.txt plan.txt
--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323933.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 21:49:03
Message-ID: 6829.1293832143@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I wrote:
> The odds seem pretty good that the "corrupt compressed data" message
> has the same origin at bottom, although the lack of any obvious data
> to be compressed in this table is confusing. Maybe you could get that
> from trying to copy over a garbage value of that one varchar column,
> though.

Nah, that's overthinking the problem. Just before the previously
identified crash point, we have

td = DatumGetHeapTupleHeader(datum);

That's a pg_detoast_datum call underneath the macro. So if you assume
that the tuple-datum pointer is pointing at garbage, you can get all
three of the observed symptoms very easily, depending on what the
garbage passing for its length word happens to look like.

So I'm pretty sure that what we're dealing with is a case of the plan
freeing a transient tuple datum sooner than it should. But the toy case
I tried here didn't fail, so evidently I'm not close enough to the plan
you're actually using. Still need to see that EXPLAIN output. It'd be
helpful also to know what nondefault configuration settings you're
using, especially work_mem and planner-related settings. Also, do you
have an idea of how many rows might've matched the WHERE conditions?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 22:11:15
Message-ID: AANLkTimcFGLUOR87_G7BtCQ2PpLvGH_FKrtt8_AiVkRy@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


The number of matching rows on these queries is anything from 0 to 10000. I
don't think I can tell how many would have matched on the ones that
crashed. Although I suspect it would have been toward the 10000 end. I've
been trying to get a reproducable test case with no luck so far.

I assume you can now see the plan? I uploaded it twice, once via gmail and
once on Nabble.

Here are all the non-default settings:

> listen_addresses = '*'
> max_connections = 450
> authentication_timeout = 20s
> shared_buffers = 18GB
> work_mem = 200MB
> maintenance_work_mem = 8GB
> shared_preload_libraries = 'auto_explain'
> wal_level = hot_standby
> synchronous_commit = off
> wal_buffers = 8MB
> commit_siblings = 1
> checkpoint_segments = 256
> checkpoint_warning = 5min
> archive_mode = on
> archive_command = 'cp %p /var/lib/pgsql/wal/%f.wrk; mv
/var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f'
> max_wal_senders = 1
> cpu_tuple_cost = 0.003
> cpu_index_tuple_cost = 0.001
> cpu_operator_cost = 0.0005
> effective_cache_size = 52GB
> default_statistics_target = 200
> log_directory = '/var/log/postgres'
> log_filename = 'pg%d.log'
> log_min_duration_statement = 7min
> log_line_prefix = '%p %u %r %t [%l]'
> log_lock_waits = on
> log_temp_files = 0
> track_functions = pl # none, pl, all
> log_autovacuum_min_duration = 5min
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_analyze_scale_factor = 0.03
> autovacuum_freeze_max_age = 1500000000 # 1,500,000,000
> autovacuum_vacuum_cost_delay = -1
> temp_tablespaces =
'ts03,ts04,ts05,ts06,ts07,ts08,ts09,ts10,ts11,ts12,ts13,ts14,ts15,ts16,ts17,ts18,ts19,ts20,ts21,ts22,ts23,ts24,ts25,ts26,ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37,ts38'
> vacuum_freeze_min_age = 500000000 # 500,000,000
> vacuum_freeze_table_age = 1300000000 # 1,300,000,000
> bytea_output = 'escape'
> deadlock_timeout = 5s
> standard_conforming_strings = on
> custom_variable_classes = 'auto_explain'
> auto_explain.log_min_duration = -1 # Remember this means for everybody!
> auto_explain.log_analyze = off ## DANGER! Don't set log_analyze to
true unless you know what you're doing!
> auto_explain.log_verbose = off
> auto_explain.log_nested_statements = on

On Fri, Dec 31, 2010 at 2:49 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323935-1680610224-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323935-1680610224-56124(at)n5(dot)nabble(dot)com>
> wrote:

>
> So I'm pretty sure that what we're dealing with is a case of the plan
> freeing a transient tuple datum sooner than it should. But the toy case
> I tried here didn't fail, so evidently I'm not close enough to the plan
> you're actually using. Still need to see that EXPLAIN output. It'd be
> helpful also to know what nondefault configuration settings you're
> using, especially work_mem and planner-related settings. Also, do you
> have an idea of how many rows might've matched the WHERE conditions?
>
>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323959.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2010-12-31 22:42:33
Message-ID: 7579.1293835353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> I assume you can now see the plan? I uploaded it twice, once via gmail and
> once on Nabble.

Yeah, the Nabble one works. Now I'm even more confused, because the
whole-row var seems to be coming from the outside of the nestloop, which
is about the simplest possible case.

> The number of matching rows on these queries is anything from 0 to 10000. I
> don't think I can tell how many would have matched on the ones that
> crashed. Although I suspect it would have been toward the 10000 end. I've
> been trying to get a reproducable test case with no luck so far.

No luck here either. A couple further questions --- do you think it's
likely that you could have been over 10000 matches, ie, the LIMIT
actually kicked in? Also, could there have been more than two
concurrent updates acting on the same rows?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2011-01-01 01:55:58
Message-ID: AANLkTiktGQzo6DynsOqoK_N66U-vVDaFgtKmx1se5DDx@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


Unfortunately it's now impossible to say how many were updated, as they get
deleted by another process later. I may be able to restore part of a dump
from 2 days ago on another machine, and get some counts from that, assuming
I have the disk space. I'll work on that.

I do not believe there could have been more than 2 concurrent updates
because the app is limited to 2 threads, but I will verify that with the
developer.

-gordon

On Fri, Dec 31, 2010 at 3:43 PM, Tom Lane-2 [via PostgreSQL] <
ml-node+3323971-508486184-56124(at)n5(dot)nabble(dot)com<ml-node%2B3323971-508486184-56124(at)n5(dot)nabble(dot)com>
> wrote:

>
>
> No luck here either. A couple further questions --- do you think it's
> likely that you could have been over 10000 matches, ie, the LIMIT
> actually kicked in? Also, could there have been more than two
> concurrent updates acting on the same rows?
>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3324015.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: seg fault crashed the postmaster
Date: 2011-01-05 01:03:48
Message-ID: 1294189428089-3328138.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


I'm putting this on this thread, since it could be related to the issue.

I'm now seeing this in the log on the HSB/SR server. It's happened about 4
times in the past 2 days.

23964 2011-01-04 05:23:00 EST [47]LOG: invalid record length at
6E53/46E8A010
23535 2011-01-04 05:23:00 EST [2]FATAL: terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or
directory
23964 2011-01-04 05:23:00 EST [48]LOG: invalid record length at
6E53/46E8A010
cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or
directory
2486 2011-01-04 05:23:00 EST [1]LOG: streaming replication successfully
connected to primary
23964 2011-01-04 15:47:59 EST [49]LOG: invalid record length at
6E6B/F8222010
2486 2011-01-04 15:47:59 EST [2]FATAL: terminating walreceiver process
due to administrator command
cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or
directory
23964 2011-01-04 15:48:00 EST [50]LOG: invalid record length at
6E6B/F8222010
cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or
directory

The word FATAL sounds very ominous.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3328138.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: u235sentinel <u235sentinel(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: beginners autovacuum question
Date: 2011-01-06 01:08:45
Message-ID: 4D25161D.6050508@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I'm tracking a problem with our tables being bloated and was curious if
someone regularly kills autovacuum jobs, will autovacuum later reattempt
to vacuum the table it was killed under?

I've made autovacuum more aggressive and given more worker threads. Yet
for some reason we're not keeping up. The tables on this system are
very active (at least many of them are). We have a large database
(above 3.5 TB at this time) on a Sun Solaris Thumper (sunfire 4500
series I believe).

Vacuum does complete when I run it manually on a table. But I'm
suspecting a coworker working late at night be my killing autovacuum.
Reading through the logs right now to see if this is the case.

Thanks!


From: Devrim GÜNDÜZ <devrim(at)gunduz(dot)org>
To: u235sentinel <u235sentinel(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: beginners autovacuum question
Date: 2011-01-06 07:20:25
Message-ID: 1294298425.2686.59.camel@lenovo01-laptop03.gunduz.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,

On Wed, 2011-01-05 at 18:08 -0700, u235sentinel wrote:
> I'm tracking a problem with our tables being bloated and was curious
> if someone regularly kills autovacuum jobs, will autovacuum later
> reattempt to vacuum the table it was killed under?

In 8.3+, autovacuum kills itself if when it conflicts with other
queries. If it is killed by a human or so, it will try to run again in
the next cycle (which depends on max_workers and timeout values).

> I've made autovacuum more aggressive and given more worker threads.

Giving more worker threads will probably not make autovacuum aggressive.
You will also want to decrease threshold values either in
postgresql.conf, or set custom parameters in pg_class per relation.

> Yet for some reason we're not keeping up.

Probably because of the threshold values.

> Vacuum does complete when I run it manually on a table. But I'm
> suspecting a coworker working late at night be my killing autovacuum.
> Reading through the logs right now to see if this is the case.

Set log_autovacuum_min_duration to 0, and log all autovac activities. It
will let you know when it cancels itself, like "ERROR: cancelling
autovacuum task" or so (this may not be the exact message, I did not
check).

Regards,
--
Devrim GÜNDÜZ
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
PostgreSQL RPM Repository: http://yum.pgrpms.org
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: u235sentinel <u235sentinel(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: beginners autovacuum question
Date: 2011-01-06 14:45:04
Message-ID: AANLkTik8JL86JisXfD0b=M=dDYoxNCFig6+p--gsSThC@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Jan 5, 2011 at 6:08 PM, u235sentinel <u235sentinel(at)gmail(dot)com> wrote:
> I'm tracking a problem with our tables being bloated and was curious if
> someone regularly kills autovacuum jobs, will autovacuum later reattempt to
> vacuum the table it was killed under?
>
> I've made autovacuum more aggressive and given more worker threads.  Yet for
> some reason we're not keeping up.

Can you show us what you've changed to make autovac more aggressive?

You might want to make sure you've lowered
autovacuum_vacuum_cost_delay (mine's at 0 as we have 28 or so disks in
a single RAID-10 and lots of spare IO). Also raise
autovacuum_vacuum_cost_limit much higher (mine's at 5000). Assuming
these tables get updated a lot, I'm gonna guess that you're not having
a problem with too high of threshold settings but with auto-vac
keeping up. But it's just a guess.

Also, what do you get from things like iostat, vmstat, and sar as
regards your IO utilization and such? If your IO system is always at
100% then more aggressive vacuuming isn't gonna do much, because
you'll always be behind.