Quick Links

BUG #5269: postgres backend terminates with SIGSEGV

Lists:	pgsql-bugs

From:	"Justin Pitts" <justinpitts(at)gmail(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-09 03:50:38
Message-ID:	201001090350.o093octq014172@wwwmaster.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

The following bug has been logged online:

Bug reference: 5269
Logged by: Justin Pitts
Email address: justinpitts(at)gmail(dot)com
PostgreSQL version: 8.4.2
Operating system: Debian Lenny 2.6.30-bpo.1-amd-64 kernel
Description: postgres backend terminates with SIGSEGV
Details:

I originally experienced this on 8.4.1.

Postgres backend terminates with SIGSEGV. I can reproduce this behavior only
about 30% of the time by manipulating my application. The application
performs a steady stream of inserts all the time, and the provoking action
is a rather complex report.

I am including a backtrace. Sanitizing a simple, clear repro case will take
us some time, and it may be easier for us to host a duplicate machine for
someone to ssh into. Please let me know if that is reasonable, or how else I
can assist.

Core was generated by `postgres: ssjiddwdbusr ssjiddwdb 192.168.20.35(46628)
S'.
Program terminated with signal 11, Segmentation fault.
[New process 9564]
#0 0x00007f1c43e0eee5 in memcpy () from /lib/libc.so.6
(gdb) bt
#0 0x00007f1c43e0eee5 in memcpy () from /lib/libc.so.6
#1 0x00000000006bf259 in CopySnapshot (snapshot=0x167db40) at
snapmgr.c:231
#2 0x00000000006bf36d in PushActiveSnapshot (snap=0x167db40) at
snapmgr.c:276
#3 0x0000000000560443 in _SPI_execute_plan (plan=0x16bfe20,
paramLI=0x16507a0, snapshot=0x0,
crosscheck_snapshot=0x0, read_only=0 '\0', fire_triggers=1 '\001',
tcount=2) at spi.c:1797
#4 0x0000000000560a96 in SPI_execute_plan (plan=0x16bfe20,
Values=0x16cfec0, Nulls=0x16cfed8 " ",
read_only=0 '\0', tcount=2) at spi.c:392
#5 0x00007f1c01dcdcda in exec_run_select (estate=0x7fffe98830e0,
expr=0x1674708, maxtuples=2, portalP=0x0)
at pl_exec.c:4149
#6 0x00007f1c01dcdfd2 in exec_eval_expr (estate=0x7fffe98830e0,
expr=0x1674708, isNull=0x7fffe9882e6f "",
rettype=0x7fffe9882e68) at pl_exec.c:4067
#7 0x00007f1c01dd0223 in exec_assign_expr (estate=0x16920f0,
target=0x16cfda0, expr=0x0) at pl_exec.c:3428
#8 0x00007f1c01dd1eb2 in exec_stmts (estate=0x7fffe98830e0, stmts=<value
optimized out>) at pl_exec.c:1345
#9 0x00007f1c01dd3216 in exec_stmt_block (estate=0x7fffe98830e0,
block=0x1675f60) at pl_exec.c:1137
#10 0x00007f1c01dd47fc in plpgsql_exec_function (func=0x16c2b60,
fcinfo=0x7fffe9883350) at pl_exec.c:315
#11 0x00007f1c01dca43e in plpgsql_call_handler (fcinfo=0x7fffe9883350) at
pl_handler.c:95
#12 0x00000000005493e6 in ExecMakeFunctionResult (fcache=0x167dd00,
econtext=0x167dbf0, isNull=0x167e9d8 "",
isDone=0x167eaf0) at execQual.c:1685
#13 0x000000000054484e in ExecProject (projInfo=<value optimized out>,
isDone=0x7fffe988383c)
at execQual.c:5007
#14 0x0000000000556e49 in ExecResult (node=0x167dae0) at nodeResult.c:155
#15 0x0000000000543c5d in ExecProcNode (node=0x167dae0) at
execProcnode.c:344
#16 0x00000000005418a2 in standard_ExecutorRun (queryDesc=0x16bed10,
direction=ForwardScanDirection, count=0)
at execMain.c:1504
#17 0x00000000005ed677 in PortalRunSelect (portal=0x164a780, forward=<value
optimized out>, count=0,
dest=0x1602cf0) at pquery.c:953
#18 0x00000000005eea29 in PortalRun (portal=0x164a780,
count=9223372036854775807, isTopLevel=1 '\001',
dest=0x1602cf0, altdest=0x1602cf0, completionTag=0x7fffe9883cd0 "") at
pquery.c:779
#19 0x00000000005eb790 in PostgresMain (argc=4, argv=<value optimized out>,

username=0x1563c40 "ssjiddwdbusr") at postgres.c:1928
#20 0x00000000005bfe08 in ServerLoop () at postmaster.c:3449
#21 0x00000000005c0b97 in PostmasterMain (argc=3, argv=0x1561240) at
postmaster.c:1040
#22 0x000000000056a558 in main (argc=3, argv=0x1561240) at main.c:188

I notice bug # 5238, but I have no idea if it is related. I'll take a
hint from it however:

(gdb) p (char *) debug_query_string
$1 = 0x166e190 "SELECT \"datawarehouse\".spr_update_device($1, $2, $3,
$4, $5, $6, $7, $8, $9, $10)"

This is somewhat unexpected. I expected the 'complex report' to be here,
but
this is a writer process which manages a data warehouse schema.

I know more information is needed. I am hoping this is helpful in the
meantime.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Justin Pitts" <justinpitts(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-09 04:04:55
Message-ID:	28098.1263009895@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

"Justin Pitts" <justinpitts(at)gmail(dot)com> writes:
> [ crash in CopySnapshot ]

This is a new one I think. Alvaro fixed some snapshot management issues
back in October, but if you're seeing it on 8.4.2 then that didn't fix
it :-(

> I am including a backtrace. Sanitizing a simple, clear repro case will take
> us some time, and it may be easier for us to host a duplicate machine for
> someone to ssh into. Please let me know if that is reasonable, or how else I
> can assist.

The trace is definitely not sufficient info by itself. If you'd prefer
to make a machine available than extract a test case, we can work with
that. Contact me or Alvaro offlist to work out details.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Justin Pitts <justinpitts(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-13 04:53:48
Message-ID:	4010.1263358428@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Justin Pitts <justinpitts(at)gmail(dot)com> writes:
> It went seriously heisenbug since I added %p. It took all day to crash it since I added that to the configuration. Except this latest crash is completely different - signal 6 ABRT, different place.

Specifically:

2010-01-13 01:50:42 UTC 5455 ERROR: snapshot reference 0xba6ac0 is not owned by resource owner TopTransaction
2010-01-13 01:50:42 UTC 5455 STATEMENT: COMMIT
2010-01-13 01:50:42 UTC 5455 WARNING: AbortTransaction while in COMMIT state
2010-01-13 01:50:42 UTC 5455 PANIC: cannot abort transaction 777554, it was already committed

This looks different, but it matches what I had concluded from the
previous dumps: snapmgr.c thinks it has a valid serializable snapshot,
but it doesn't. In your previous crashes, CurrentSnaphot was pointing
at trashed memory; here, it's still a valid-looking snapshot, but it's
not registered in the TopTransactionResourceOwner like it's supposed to
be (it wasn't in the other cases either).

After puzzling over this for many hours, I have a theory that seems to
fit the facts. What I think must be happening is that
GetTransactionSnapshot is getting called outside any transaction ---
that is, after AtEOXact_Snapshot but before CleanupTransaction (thus,
the old TopTransactionResourceOwner is still there, so its attempt
to register the snap doesn't dump core). Your build isn't
assert-enabled so the extra snapshot registration wouldn't be noticed
by the eventual ResourceOwnerDelete. Then, when we next start a
transaction and request a serializable snapshot, GetTransactionSnapshot
thinks it has nothing to do and happily hands back a pointer to garbage.

Now, how the heck did GetTransactionSnapshot get called that way?
Your log shows that you're using client software that likes to abort
failed transactions with a previously prepared statement:

2010-01-13 01:50:41 UTC 5455 LOG: execute S_2: ROLLBACK

This means that to get to CleanupTransaction, we have to look up a
cached plan while in aborted transaction state. And that will call
RevalidateCachedPlan, which does this if the plan is marked dead:

/*
* If a snapshot is already set (the normal case), we can just use
* that for parsing/planning. But if it isn't, install one. Note: no
* point in checking whether parse analysis requires a snapshot;
* utility commands don't have invalidatable plans, so we'd not get
* here for such a command.
*/
if (!ActiveSnapshotSet())
{
PushActiveSnapshot(GetTransactionSnapshot());
snapshot_set = true;
}

As the comment indicates, it should not be possible to get here for a
ROLLBACK plan. But look what got added in 8.4: ResetPlanCache(), which
marks plans dead *unconditionally*. And one of the ways that that can
get invoked is via a SI queue overrun event, which triggers a general
cache flush and in particular calls PlanCacheSysCallback().

So what is needed to make this scenario happen is a SI queue overrun,
which isn't a common event unless your application does a lot
of catalog manipulations. I suppose that the reason your reporting
query manages to tickle it is that it creates/deletes temp tables.
And even then, you'd need an actual failure on the side receiving
the overrun notification. In the cases at hand that failure is

2010-01-13 01:50:41 UTC 5455 ERROR: could not serialize access due to concurrent update

which doesn't happen real often in your app, but does happen.

So this theory explains what I found in the core dumps, which is
otherwise dang hard to explain (I spent a lot of time looking for
code paths that could bypass AtEOXact_Snapshot, but there aren't
any). It explains why the failure is uncommon. It explains why the
failures come in bunches (since a SI overrun is broadcast, and
would thereby poison all active sessions with "dead" ROLLBACK
plans; and the subsequent serialization errors could easily come
in bunches too). It explains why we haven't seen similar stuff
in testing --- preparing something as trivial as a ROLLBACK is
pretty silly so nobody does it. And it explains why we didn't
see it before 8.4 --- the bug was in fact introduced here:
http://archives.postgresql.org/pgsql-committers/2008-09/msg00084.php
So even though I haven't actually reproduced the problem, I have a
fair amount of confidence in the explanation.

We could perhaps fix this by changing the above-quoted code, but
I'm inclined to teach ResetPlanCache that utility statements should
never be marked dead. They don't have plans in the normal sense
of the word so there's no point in forcing a revalidation cycle.

I'm kinda beat right now but will work on a patch tomorrow.

regards, tom lane

From:	Kris Jurka <books(at)ejurka(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Justin Pitts <justinpitts(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-13 07:07:21
Message-ID:	alpine.BSO.2.00.1001130201380.24423@leary.csoft.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Tue, 12 Jan 2010, Tom Lane wrote:

> preparing something as trivial as a ROLLBACK is pretty silly so nobody
> does it.

Actually the JDBC driver has been preparing BEGIN, COMMIT, and ROLLBACK
since the 8.0 release.

http://archives.postgresql.org/pgsql-jdbc/2006-10/msg00149.php

Kris Jurka

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Justin Pitts <justinpitts(at)gmail(dot)com>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-13 16:58:48
Message-ID:	14512.1263401928@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

I wrote:
> After puzzling over this for many hours, I have a theory that seems to
> fit the facts.

I think the attached patch will fix it for you --- please test.

regards, tom lane

Attachment	Content-Type	Size
resetplancache-fix.patch	text/x-patch	2.2 KB

From:	Justin Pitts <justinpitts(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 03:22:30
Message-ID:	1871DADD-ECC1-4115-B4CF-03005AAC7641@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Sorry for the delay.

I am attempting to construct a JDBC test case that reproduces the problem.

I have installed the patch and have not seen the crash since.

On Jan 13, 2010, at 11:58 AM, Tom Lane wrote:

> I wrote:
>> After puzzling over this for many hours, I have a theory that seems to
>> fit the facts.
>
> I think the attached patch will fix it for you --- please test.
>
> regards, tom lane
>
> Index: src/backend/utils/cache/plancache.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/cache/plancache.c,v
> retrieving revision 1.27.2.1
> diff -c -r1.27.2.1 plancache.c
> *** src/backend/utils/cache/plancache.c 14 Jul 2009 15:37:55 -0000 1.27.2.1
> --- src/backend/utils/cache/plancache.c 13 Jan 2010 16:46:07 -0000
> ***************
> *** 1050,1063 ****
> void
> ResetPlanCache(void)
> {
> ! ListCell *lc;
>
> ! foreach(lc, cached_plans_list)
> {
> ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc);
> CachedPlan *plan = plansource->plan;
>
> ! if (plan)
> ! plan->dead = true;
> }
> }
> --- 1050,1106 ----
> void
> ResetPlanCache(void)
> {
> ! ListCell *lc1;
>
> ! foreach(lc1, cached_plans_list)
> {
> ! CachedPlanSource *plansource = (CachedPlanSource *) lfirst(lc1);
> CachedPlan *plan = plansource->plan;
> + ListCell *lc2;
>
> ! /* No work if it's already invalidated */
> ! if (!plan || plan->dead)
> ! continue;
> !
> ! /*
> ! * We *must not* mark transaction control statements as dead,
> ! * particularly not ROLLBACK, because they may need to be executed in
> ! * aborted transactions when we can't revalidate them (cf bug #5269).
> ! * In general there is no point in invalidating utility statements
> ! * since they have no plans anyway. So mark it dead only if it
> ! * contains at least one non-utility statement.
> ! */
> ! if (plan->fully_planned)
> ! {
> ! /* Search statement list for non-utility statements */
> ! foreach(lc2, plan->stmt_list)
> ! {
> ! PlannedStmt *plannedstmt = (PlannedStmt *) lfirst(lc2);
> !
> ! Assert(!IsA(plannedstmt, Query));
> ! if (IsA(plannedstmt, PlannedStmt))
> ! {
> ! /* non-utility statement, so invalidate */
> ! plan->dead = true;
> ! break; /* out of stmt_list scan */
> ! }
> ! }
> ! }
> ! else
> ! {
> ! /* Search Query list for non-utility statements */
> ! foreach(lc2, plan->stmt_list)
> ! {
> ! Query *query = (Query *) lfirst(lc2);
> !
> ! Assert(IsA(query, Query));
> ! if (query->commandType != CMD_UTILITY)
> ! {
> ! /* non-utility statement, so invalidate */
> ! plan->dead = true;
> ! break; /* out of stmt_list scan */
> ! }
> ! }
> ! }
> }
> }

From:	Justin Pitts <jpitts(at)bplglobal(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:07:36
Message-ID:	C1CAC7F4-7E90-4778-BB7B-D08ACD541EB3@bplglobal.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Apologies for over-quoting on the previous message.

My approach to a regression test was this single-threaded test ( actual code at bottom )

Connection 1 - implicitly autocommiting every statement
create a test table and populate it with a single row.
CREATE TABLE FOO ( a int not null );
INSERT INTO FOO VALUES (1);
set the connection autocommit to false ( implied transaction begin on next statement )
read
SELECT * FROM FOO;
modify
UPDATE FOO SET a=a+1;
( transaction is still open )
Connection 2
set the connection autocommit to false
read
SELECT * FROM FOO;
Connection 1
commit;
Connection 2
modify
UPDATE FOO SET a=a+1
throws concurrent serialization error
Connection 3
do a bunch of catalog updates
for f in 1..??? ( i tried 1, 10, 100, and 4096 )
CREATE TEMP TABLE BAR$f (a int not null );
Connection 2
rollback <- I am expecting a server crash here. It does not crash.

My guess is that I am not provoking a 'SI queue overrun'
( I am assuming this is a reference to the queue implemented in sinvaladt.c. My C skills are rusty at best. )

Am I completely off base about how this should be reproducing?

Bug5269Test.java

import java.sql.Connection;
import java.sql.Driver;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;

/* Regression test for PostgreSQL Bug # 5269
* http://archives.postgresql.org/pgsql-bugs/2010-01/msg00080.php
* */

public class Bug5269Test {

private Connection newConnection(Driver driver) {
Properties info = new Properties();
info.put("user", "postgres");
try {
return driver.connect("jdbc:postgresql://localhost:5433", info);
} catch (SQLException e) {
e.printStackTrace();
}
return null;
}

public void test() throws SQLException, InterruptedException {
Driver driver = new org.postgresql.Driver();
Connection a = newConnection(driver);
setupTable(a);

Connection b = newConnection(driver);
a.setAutoCommit(false);
a.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE);
b.setAutoCommit(false);
b.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE);

String select = "select * from foo";
String update = "update foo set a = a + 1";

a.prepareStatement(select).execute();
System.out.println("select a");

a.prepareStatement(update).execute();
System.out.println("update a");

b.prepareStatement(select).execute();
System.out.println("select b");

a.commit();
System.out.println("commit a");
try {
b.prepareStatement(update).execute();
b.commit();
throw new RuntimeException("Expected a serialization failure.");
} catch (SQLException sqle) {
System.out.println(sqle.getMessage());
flogCatalogs(driver);
System.out.println("flog catalogs");

// The hope is that an unpatched system would crash here.
b.rollback();
System.out.println("rollback.");
}
a.close();
b.close();
}

private void setupTable(Connection a) throws SQLException {
try {
a.prepareStatement("drop table foo").execute();
} catch (Exception e) {
e.printStackTrace();
}
a.prepareStatement("create table foo(a int not null)").executeUpdate();
a.prepareStatement("insert into foo values (1);").executeUpdate();
}

private void flogCatalogs(final Driver driver) throws SQLException {
final Connection conn = newConnection(driver);
try {
conn.setAutoCommit(false);
Statement ps = conn.createStatement();
String a = "create temp table bar";
String b = " (a int not null); ";
for (int f = 0; f < 128; f++) {
StringBuilder sb = new StringBuilder();
sb.append(a).append(f).append(b);
ps.execute(sb.toString());
}
conn.commit();
} catch (SQLException e) {
e.printStackTrace();
} finally {
try {
conn.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}

public static void main(String... args) throws SQLException,
InterruptedException {
Bug5269Test test = new Bug5269Test();
test.test();
}

}

On Jan 13, 2010, at 10:22 PM, Justin Pitts wrote:

> Sorry for the delay.
>
> I am attempting to construct a JDBC test case that reproduces the problem.
>
> I have installed the patch and have not seen the crash since.
>
> On Jan 13, 2010, at 11:58 AM, Tom Lane wrote:
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Justin Pitts <jpitts(at)bplglobal(dot)net>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:16:50
Message-ID:	23567.1263482210@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Justin Pitts <jpitts(at)bplglobal(dot)net> writes:
> My guess is that I am not provoking a 'SI queue overrun'

The 100 temp table creations probably will do that just fine.

> Am I completely off base about how this should be reproducing?

Two points: the session you hope to have crash *must* be in serializable
mode, and the crash would actually happen in the transaction after the
one that's rolled back.

The error doesn't have to be a serialization error, so in principle
you should be able to make it fail with something as simple as

begin;
select 1/0;
rollback;
select * from foo;

as long as the ROLLBACK is done with a prepared statement and you've
forced a SI overrun since the ROLLBACK was prepared.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Justin Pitts <jpitts(at)bplglobal(dot)net>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:19:22
Message-ID:	23614.1263482362@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

BTW --- from my own perspective, what is much more interesting is
testing time on your real application to make sure that you don't
see a crash any more. I am certain that I identified and fixed
*a* bug, what I'm less certain about is that it is actually what
you were hitting. The previous advice would help you reproduce
the bug that I fixed, but ...

regards, tom lane

From:	Justin Pitts <jpitts(at)bplglobal(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:25:46
Message-ID:	EA2B064A-2D8C-4948-8F72-C57EFF39863E@bplglobal.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

As difficult as it was to reproduce the bug, I'd like a clear cut regression test. The use case where it manifested is fairly rare in normal use - server start after prolonged outage, where the queue of inbound device events is built up enough to make serialization errors more likely.
On Jan 14, 2010, at 10:19 AM, Tom Lane wrote:

> BTW --- from my own perspective, what is much more interesting is
> testing time on your real application to make sure that you don't
> see a crash any more. I am certain that I identified and fixed
> *a* bug, what I'm less certain about is that it is actually what
> you were hitting. The previous advice would help you reproduce
> the bug that I fixed, but ...
>
> regards, tom lane

From:	Justin Pitts <jpitts(at)bplglobal(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:34:10
Message-ID:	02566E30-5D4D-495F-A3B2-52899610854D@bplglobal.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Jan 14, 2010, at 10:16 AM, Tom Lane wrote:

> Justin Pitts <jpitts(at)bplglobal(dot)net> writes:
>> My guess is that I am not provoking a 'SI queue overrun'
>
> The 100 temp table creations probably will do that just fine.
>

Is there a way to verify this?

>> Am I completely off base about how this should be reproducing?
>
> Two points: the session you hope to have crash *must* be in serializable
> mode,

The 2 competing sessions doing the read/modify sequence on foo are set to SERIALIZABLE.

> and the crash would actually happen in the transaction after the
> one that's rolled back.
>

I don't follow. Are you suggesting I begin another transaction on connection 1 with a read, and that
would provoke the crash?

> The error doesn't have to be a serialization error, so in principle
> you should be able to make it fail with something as simple as
>
> begin;
> select 1/0;
> rollback;
> select * from foo;
>
> as long as the ROLLBACK is done with a prepared statement and you've
> forced a SI overrun since the ROLLBACK was prepared.
>
> regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Justin Pitts <jpitts(at)bplglobal(dot)net>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 15:44:49
Message-ID:	24024.1263483889@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Justin Pitts <jpitts(at)bplglobal(dot)net> writes:
> On Jan 14, 2010, at 10:16 AM, Tom Lane wrote:
>> The 100 temp table creations probably will do that just fine.

> Is there a way to verify this?

You could add an elog(LOG, "message") into ResetPlanCache so you could
tell when it had been called.

> I don't follow. Are you suggesting I begin another transaction on connection 1 with a read, and that
> would provoke the crash?

Yes. The rollback only sets the stage for the next transaction to try
to use a snapshot that isn't there anymore.

regards, tom lane

From:	Justin Pitts <jpitts(at)bplglobal(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgreSQL(dot)org
Subject:	Re: BUG #5269: postgres backend terminates with SIGSEGV
Date:	2010-01-14 19:03:45
Message-ID:	F7ACA758-1E26-4413-B6E3-DFC5EA6B370E@bplglobal.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Jan 14, 2010, at 10:44 AM, Tom Lane wrote:

> Justin Pitts <jpitts(at)bplglobal(dot)net> writes:
>> On Jan 14, 2010, at 10:16 AM, Tom Lane wrote:
>>> The 100 temp table creations probably will do that just fine.
>
>> Is there a way to verify this?
>
> You could add an elog(LOG, "message") into ResetPlanCache so you could
> tell when it had been called.
>

Done. Sometimes I see it, sometimes not.

>> I don't follow. Are you suggesting I begin another transaction on connection 1 with a read, and that
>> would provoke the crash?
>
> Yes. The rollback only sets the stage for the next transaction to try
> to use a snapshot that isn't there anymore.
>

Oh, duh. A read from the same session that rolled-back. That didn't get it working (failing?) however.

Running concurrent instances of this test reliably provokes the crash on un-patched 8.4.2.
They do not provoke a crash with the patch. That's what i was looking for.

Thanks!

> regards, tom lane