Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)

Lists: pgsql-bugspgsql-hackers
From: "YAMAMOTO Takashi" <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #6218: TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-21 15:22:40
Message-ID: 201109211522.p8LFMe76017557@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


The following bug has been logged online:

Bug reference: 6218
Logged by: YAMAMOTO Takashi
Email address: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp
PostgreSQL version: 9.2
Operating system: NetBSD
Description: TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File:
"resowner.c", Line: 365)
Details:

9.2devel (2562dcea811eb642e1c5442e1ede9fe268278157)

ERROR: could not serialize access due to concurrent update
STATEMENT: UPDATE file SET atime = current_timestamp WHERE fileid = $1
TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File: "resowner.c", Line:
365)
LOG: server process (PID 16832) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes

Core was generated by `postgres'.
Program terminated with signal 6, Aborted.
#0 0xbbb9c547 in _lwp_kill () from /usr/lib/libc.so.12
(gdb) bt
#0 0xbbb9c547 in _lwp_kill () from /usr/lib/libc.so.12
#1 0xbbb9c503 in raise (s=6) at /siro/nbsd/src/lib/libc/gen/raise.c:48
#2 0xbbb9bced in abort () at /siro/nbsd/src/lib/libc/stdlib/abort.c:74
#3 0x08400d8b in ExceptionalCondition (
conditionName=0x85a28f3 "!(owner->nsnapshots == 0)",
errorType=0x85a280b "FailedAssertion", fileName=0x85a2800 "resowner.c",
lineNumber=365) at assert.c:57
#4 0x084272a7 in ResourceOwnerDelete (owner=0x99be23f0) at resowner.c:365
#5 0x080d96f3 in CleanupTransaction () at xact.c:2374
#6 0x080d98d9 in CommitTransactionCommand () at xact.c:2538
#7 0x08318802 in finish_xact_command () at postgres.c:2378
#8 0x08317c19 in exec_execute_message (portal_name=0x99b2f01c "",
max_rows=2147483647) at postgres.c:1915
#9 0x0831a6cf in PostgresMain (argc=2, argv=0xbb9117b0,
username=0xbb911704 "takashi") at postgres.c:3909
#10 0x082c47b0 in BackendRun (port=0xbb93d0f0) at postmaster.c:3512
#11 0x082c3e12 in BackendStartup (port=0xbb93d0f0) at postmaster.c:3197
#12 0x082c0e94 in ServerLoop () at postmaster.c:1333
#13 0xbb93d0f0 in ?? ()
#14 0x085f8850 in UnBlockSig ()
#15 0x00000000 in ?? ()
(gdb) fr 4
#4 0x084272a7 in ResourceOwnerDelete (owner=0x99be23f0) at resowner.c:365
365 Assert(owner->nsnapshots == 0);
(gdb) p *owner
$1 = {parent = 0x0, firstchild = 0x99b2c174, nextchild = 0x0,
name = 0x844bfb2 "TopTransaction", nbuffers = 0, buffers = 0x0,
maxbuffers = 0, ncatrefs = 0, catrefs = 0x99b85f88, maxcatrefs = 16,
ncatlistrefs = 0, catlistrefs = 0x0, maxcatlistrefs = 0, nrelrefs = 0,
relrefs = 0x0, maxrelrefs = 0, nplanrefs = 0, planrefs = 0x0,
maxplanrefs = 0, ntupdescs = 0, tupdescs = 0x0, maxtupdescs = 0,
nsnapshots = 1, snapshots = 0xbb9f4fa4, maxsnapshots = 16, nfiles = 0,
files = 0x0, maxfiles = 0}
(gdb) p *owner->snapshots[0]
$7 = {satisfies = 0x8435138 <HeapTupleSatisfiesMVCC>, xmin = 991, xmax =
991,
xcnt = 0, xip = 0x0, subxcnt = 0, subxip = 0x0, suboverflowed = 0 '\0',
takenDuringRecovery = 0 '\0', curcid = 1, active_count = 0, regd_count =
1,
copied = 1 '\001'}
(gdb) p CurrentTransactionState
$8 = (TransactionState) 0x85bea00
(gdb) p *CurrentTransactionState
$9 = {transactionId = 990, subTransactionId = 1, name = 0x0,
savepointLevel = 0, state = TRANS_ABORT, blockState = TBLOCK_ABORT_END,
nestingLevel = 1, gucNestLevel = 1, curTransactionContext = 0xbb912a40,
curTransactionOwner = 0x99be23f0, childXids = 0x0, nChildXids = 0,
maxChildXids = 0, prevUser = 10, prevSecContext = 0,
prevXactReadOnly = 0 '\0', startedInRecovery = 0 '\0', parent = 0x0}
(gdb) p debug_query_string
$12 = 0x99b2f428 "ROLLBACK"
(gdb)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "YAMAMOTO Takashi" <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6218: TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-21 15:58:00
Message-ID: 24084.1316620680@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

"YAMAMOTO Takashi" <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> writes:
> 9.2devel (2562dcea811eb642e1c5442e1ede9fe268278157)

> ERROR: could not serialize access due to concurrent update
> STATEMENT: UPDATE file SET atime = current_timestamp WHERE fileid = $1
> TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File: "resowner.c", Line:
> 365)
> LOG: server process (PID 16832) was terminated by signal 6: Abort trap

There isn't terribly much we can do with this report unless you can
provide a complete test case to reproduce it.

regards, tom lane


From: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-23 17:10:26
Message-ID: 20110923171026.2962F14A1BC@mail.netbsd.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

hi,

> "YAMAMOTO Takashi" <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> writes:
>> 9.2devel (2562dcea811eb642e1c5442e1ede9fe268278157)
>
>> ERROR: could not serialize access due to concurrent update
>> STATEMENT: UPDATE file SET atime = current_timestamp WHERE fileid = $1
>> TRAP: FailedAssertion("!(owner->nsnapshots == 0)", File: "resowner.c", Line:
>> 365)
>> LOG: server process (PID 16832) was terminated by signal 6: Abort trap
>
> There isn't terribly much we can do with this report unless you can
> provide a complete test case to reproduce it.
>
> regards, tom lane

after some investigation, i think it is triggered by protocol-level prepare
of ROLLBACK in an aborted transaction. does the following patch make sense?

YAMAMOTO Takashi

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 9cccc87..d1da7df 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -520,6 +520,7 @@ RevalidateCachedQuery(CachedPlanSource *plansource)
* don't have invalidatable plans, so we'd not get here for such a
* command.
*/
+ Assert(analyze_requires_snapshot(plansource->raw_parse_tree));
snapshot_set = false;
if (!ActiveSnapshotSet())
{
@@ -757,13 +758,11 @@ BuildCachedPlan(CachedPlanSource *plansource, List *qlist,

/*
* If a snapshot is already set (the normal case), we can just use
- * that for parsing/planning. But if it isn't, install one. Note: no
- * point in checking whether parse analysis requires a snapshot;
- * utility commands don't have invalidatable plans, so we'd not get
- * here for such a command.
+ * that for parsing/planning. But if it isn't, install one.
*/
snapshot_set = false;
- if (!ActiveSnapshotSet())
+ if (!ActiveSnapshotSet() &&
+ analyze_requires_snapshot(plansource->raw_parse_tree))
{
PushActiveSnapshot(GetTransactionSnapshot());
snapshot_set = true;


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-25 20:00:09
Message-ID: 5291.1316980809@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi) writes:
>> There isn't terribly much we can do with this report unless you can
>> provide a complete test case to reproduce it.

> after some investigation, i think it is triggered by protocol-level prepare
> of ROLLBACK in an aborted transaction. does the following patch make sense?

Maybe, but I'd still like to see a test case, because I can't reproduce
any such problem by preparing ROLLBACK in an aborted transaction.

regards, tom lane


From: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-26 10:06:53
Message-ID: 20110926100653.4D3BC14A143@mail.netbsd.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

hi,

> yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi) writes:
>>> There isn't terribly much we can do with this report unless you can
>>> provide a complete test case to reproduce it.
>
>> after some investigation, i think it is triggered by protocol-level prepare
>> of ROLLBACK in an aborted transaction. does the following patch make sense?
>
> Maybe, but I'd still like to see a test case, because I can't reproduce
> any such problem by preparing ROLLBACK in an aborted transaction.

reading GetTransactionSnapshot, it seems that the problem happens
only with IsolationUsesXactSnapshot() true.
the attached program triggers the assertion failure reliably for me.

YAMAMOTO Takashi

>
> regards, tom lane

Attachment Content-Type Size
pg.c text/plain 893 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
Cc: pgsql-bugs(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-26 16:26:37
Message-ID: 8244.1317054397@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi) writes:
>> Maybe, but I'd still like to see a test case, because I can't reproduce
>> any such problem by preparing ROLLBACK in an aborted transaction.

> reading GetTransactionSnapshot, it seems that the problem happens
> only with IsolationUsesXactSnapshot() true.

Hmm. I'm inclined to think that this demonstrates a bug in snapshot
management, not so much in plancache. We have plancache doing

PushActiveSnapshot(GetTransactionSnapshot());

and then later

PopActiveSnapshot();

and at this point surely it is not plancache's fault if there is any
remaining refcount for the snapshot. There is, though, because
GetTransactionSnapshot saved a refcount in TopTransactionResourceOwner.
I think it's snapmgr.c's responsibility to make sure that that's cleaned
up, and it's not doing so.

The place where that refcount normally gets dropped is
AtEarlyCommit_Snapshot, but that isn't going to be called at all in
aborted-transaction cleanup. Worse, if we just transposed it over to be
called in a place in AbortTransaction comparable to where it's called
during commit, that still wouldn't fix the problem, because when the
ROLLBACK happens, we've already aborted the transaction.

I think that AtEarlyCommit_Snapshot is misdesigned, and that far from
being done "early" in commit/abort, it needs to be done "late", like
somewhere not very long before the
ResourceOwnerDelete(TopTransactionResourceOwner) calls. There is no
very good reason to think that someone might not ask for a snapshot
during commit processing.

Alvaro, do you happen to remember why this got designed as an "early"
transaction shutdown action, rather than delaying it as long as
possible?

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: yamt <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-26 21:29:53
Message-ID: 1317070986-sup-4844@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


Excerpts from Tom Lane's message of lun sep 26 13:26:37 -0300 2011:
>
> yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi) writes:
> >> Maybe, but I'd still like to see a test case, because I can't reproduce
> >> any such problem by preparing ROLLBACK in an aborted transaction.
>
> > reading GetTransactionSnapshot, it seems that the problem happens
> > only with IsolationUsesXactSnapshot() true.
>
> Hmm. I'm inclined to think that this demonstrates a bug in snapshot
> management, not so much in plancache. We have plancache doing
>
> PushActiveSnapshot(GetTransactionSnapshot());
>
> and then later
>
> PopActiveSnapshot();
>
> and at this point surely it is not plancache's fault if there is any
> remaining refcount for the snapshot. There is, though, because
> GetTransactionSnapshot saved a refcount in TopTransactionResourceOwner.
> I think it's snapmgr.c's responsibility to make sure that that's cleaned
> up, and it's not doing so.

Agreed.

> The place where that refcount normally gets dropped is
> AtEarlyCommit_Snapshot, but that isn't going to be called at all in
> aborted-transaction cleanup. Worse, if we just transposed it over to be
> called in a place in AbortTransaction comparable to where it's called
> during commit, that still wouldn't fix the problem, because when the
> ROLLBACK happens, we've already aborted the transaction.

... ouch.

> I think that AtEarlyCommit_Snapshot is misdesigned, and that far from
> being done "early" in commit/abort, it needs to be done "late", like
> somewhere not very long before the
> ResourceOwnerDelete(TopTransactionResourceOwner) calls. There is no
> very good reason to think that someone might not ask for a snapshot
> during commit processing.
>
> Alvaro, do you happen to remember why this got designed as an "early"
> transaction shutdown action, rather than delaying it as long as
> possible?

As far as I remember, the only principle was that it had to run before
ResourceOwner cleanup. Commit 7b640b0345dc4fbd39ff568700985b432f6afa07
introduces that "early" call; ResOwner support had been introduced 10
days before in 6bbef4e5383c99d93aa974e2c79d328cfbd1c4a9. I probably
just tried it out and noticed that resowner.c complained if I didn't
drop the refcount prior to its own cleanup.

I don't think I ever considered the scenario of calls in aborted
transactions.

Shall I work on a fix? I expect you are plenty busy with commitfest
stuff, but please let me know otherwise.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: yamt <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-26 21:46:12
Message-ID: 2126.1317073572@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of lun sep 26 13:26:37 -0300 2011:
>> The place where that refcount normally gets dropped is
>> AtEarlyCommit_Snapshot, but that isn't going to be called at all in
>> aborted-transaction cleanup. Worse, if we just transposed it over to be
>> called in a place in AbortTransaction comparable to where it's called
>> during commit, that still wouldn't fix the problem, because when the
>> ROLLBACK happens, we've already aborted the transaction.

> ... ouch.

> Shall I work on a fix? I expect you are plenty busy with commitfest
> stuff, but please let me know otherwise.

I have what-I-think-is-the-fix pretty clear in my own mind, so let me
give it a try. If it doesn't work I'll bounce it back to you.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: yamt <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [BUGS] BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-26 23:59:45
Message-ID: 6382.1317081585@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Shall I work on a fix? I expect you are plenty busy with commitfest
>> stuff, but please let me know otherwise.

> I have what-I-think-is-the-fix pretty clear in my own mind, so let me
> give it a try. If it doesn't work I'll bounce it back to you.

Well, I soon ran into the issue that delaying the snapshot release makes
TopTransactionResourceOwner spit up. After some reflection I decided
that the real problem is a circular dependency: snapshot management must
be considered lower-level than ResourceOwners because ResourceOwners
tell snapshot management what to do, but here we have
GetTransactionSnapshot trying to use TopTransactionResourceOwner to
manage its internal reference to the transaction snapshot.

Accordingly, the attached proposed patch gets rid of the circularity
by removing snapmgr.c's dependency on TopTransactionResourceOwner,
in favor of having it track the refcount "manually". This was messier
than I'd hoped because the bogus design had propagated into the SSI
manager meanwhile, but removing the TopTransactionResourceOwner
dependency from that too seems like a good idea.

This passes the regular and isolation regression tests, and it's also
okay with Yamamoto-san's prepared-ROLLBACK test case even without the
band-aid fix in plancache.c. I can't immediately think of any other
test cases to throw at it.

Comments?

regards, tom lane

Attachment Content-Type Size
snapmgr-circularity-fix.patch text/x-patch 15.2 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: yamt <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-27 00:33:03
Message-ID: 1317083058-sup-9596@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


Excerpts from Tom Lane's message of lun sep 26 20:59:45 -0300 2011:

> Well, I soon ran into the issue that delaying the snapshot release makes
> TopTransactionResourceOwner spit up. After some reflection I decided
> that the real problem is a circular dependency: snapshot management must
> be considered lower-level than ResourceOwners because ResourceOwners
> tell snapshot management what to do, but here we have
> GetTransactionSnapshot trying to use TopTransactionResourceOwner to
> manage its internal reference to the transaction snapshot.

Great. I noticed the circularity but didn't reflect that it was bogus
in itself.

> Accordingly, the attached proposed patch gets rid of the circularity
> by removing snapmgr.c's dependency on TopTransactionResourceOwner,
> in favor of having it track the refcount "manually". This was messier
> than I'd hoped because the bogus design had propagated into the SSI
> manager meanwhile, but removing the TopTransactionResourceOwner
> dependency from that too seems like a good idea.

To be honest, I panicked for a second when I saw the new
SnapshotResetXmin call, before I realized that it wasn't necessary
before. The serializable case makes more sense the patched way, I
think.

> This passes the regular and isolation regression tests, and it's also
> okay with Yamamoto-san's prepared-ROLLBACK test case even without the
> band-aid fix in plancache.c. I can't immediately think of any other
> test cases to throw at it.

I added tests for all the problematic cases discovered after snapmgr was
introduced, so at least most known weird spots are covered.

Thanks

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: yamt <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #6218: TRAP: FailedAssertion( "!(owner->nsnapshots == 0)", File: "resowner.c", Line: 365)
Date: 2011-09-27 00:43:45
Message-ID: 7135.1317084225@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> To be honest, I panicked for a second when I saw the new
> SnapshotResetXmin call, before I realized that it wasn't necessary
> before. The serializable case makes more sense the patched way, I
> think.

Yeah, in the old coding, SnapshotResetXmin would have happened
implicitly at the last UnregisterSnapshot call. In this arrangement,
the "last UnregisterSnapshot" is going to be the manual one in
AtEOXact_Snapshot, so we need a manual SnapshotResetXmin there too.
I chose to put it at the bottom of the routine so that it's guaranteed
to fire even if the RegisteredSnapshots count was corrupted, but that's
just paranoia.

regards, tom lane