Assertion failure in AtCleanup_Portals

Lists: pgsql-hackers
From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Assertion failure in AtCleanup_Portals
Date: 2012-02-03 15:30:14
Message-ID: CABOikdMghBjwbyf6jdb=CSgAb0qERJY0JLK=TLf+CdtN5kThMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

If I run the following sequence of commands, I get an assertion
failure in current HEAD.

postgres=# BEGIN;
BEGIN
postgres=# SELECT 1/0;
ERROR: division by zero
postgres=# ROLLBACK TO A;
ERROR: no such savepoint
postgres=# \q

The process fails when the session is closed and aborted transaction
gets cleaned at the proc_exit time. The stack trace is as below.

(gdb) bt
#0 0x00c16422 in __kernel_vsyscall ()
#1 0x00244651 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0x00247a82 in *__GI_abort () at abort.c:92
#3 0x0844b800 in ExceptionalCondition (conditionName=0x86161e8
"!(portal->cleanup == ((void *)0))", errorType=0x8615e94
"FailedAssertion",
fileName=0x8615e88 "portalmem.c", lineNumber=793) at assert.c:57
#4 0x08472a5a in AtCleanup_Portals () at portalmem.c:793
#5 0x080f2535 in CleanupTransaction () at xact.c:2373
#6 0x080f40d7 in AbortOutOfAnyTransaction () at xact.c:3855
#7 0x0845cc5e in ShutdownPostgres (code=0, arg=0) at postinit.c:966
#8 0x08332ba7 in shmem_exit (code=0) at ipc.c:221
#9 0x08332ab3 in proc_exit_prepare (code=0) at ipc.c:181
#10 0x08332a15 in proc_exit (code=0) at ipc.c:96
#11 0x0835b07c in PostgresMain (argc=2, argv=0xa12fa64,
username=0xa12f958 "pavan") at postgres.c:4091

I analyzed this a bit. It seems that the first statement throws an
error, runs AbortCurrentTransaction and puts the transaction block in
TBLOCK_ABORT state. Any commands executed after this would usually
fail with error "current transaction is aborted". But the ROLLBACK TO
gets past this check because IsTransactionExitStmt() returns success
(and rightly so). So we end up creating a portal to run the ROLLBACK
TO command. This command fails because the SAVEPOINT A does not exist
and we throw an error. But since the transaction block is already in
TBLOCK_ABORT state, the AbortCurrentTransaction doesn't do any work.
The cleanup routine for the portal remains unexecuted. Later when the
backend exits and cleanup routines are called, the assertion inside
AtCleanup_Portals fails.

I am not sure what is the right fix for this. I see Tom fixed a
similar complain sometime back.
http://git.postgresql.org/pg/commitdiff/6252c4f9e201f619e5eebda12fa867acd4e4200e

But may its not enough to cover all code paths, as demonstrated by this example.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in AtCleanup_Portals
Date: 2012-02-06 21:33:26
Message-ID: 27845.1328564006@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> If I run the following sequence of commands, I get an assertion
> failure in current HEAD.

> postgres=# BEGIN;
> BEGIN
> postgres=# SELECT 1/0;
> ERROR: division by zero
> postgres=# ROLLBACK TO A;
> ERROR: no such savepoint
> postgres=# \q

> The process fails when the session is closed and aborted transaction
> gets cleaned at the proc_exit time. The stack trace is as below.

Hmm. It works fine if you issue an actual ROLLBACK command there,
so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
duplicating the full-fledged ROLLBACK code path. No time to dig further
right now though.

regards, tom lane


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in AtCleanup_Portals
Date: 2012-02-07 11:21:07
Message-ID: CABOikdORwSA4Ti+iNAD0h7BjwyV57tUvEe+JHM08SHH=wc-ZEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 7, 2012 at 3:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
>> If I run the following sequence of commands, I get an assertion
>> failure in current HEAD.
>
>> postgres=# BEGIN;
>> BEGIN
>> postgres=# SELECT 1/0;
>> ERROR:  division by zero
>> postgres=# ROLLBACK TO A;
>> ERROR:  no such savepoint
>> postgres=# \q
>
>> The process fails when the session is closed and aborted transaction
>> gets cleaned at the proc_exit time. The stack trace is as below.
>
> Hmm.  It works fine if you issue an actual ROLLBACK command there,
> so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
> duplicating the full-fledged ROLLBACK code path.  No time to dig further
> right now though.
>

It works OK for a ROLLBACK command because we create a new unnamed
portal for the ROLLBACK command, silently dropping the old one if it
already exists. Since the ROLLBACK command then runs successfully, we
don't see the same assertion. Would it be safe to drop FAILED unnamed
portals during AbortOutAnyTransaction ? May be it is if we can do that
before creating a new portal for ROLLBACK command itself.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in AtCleanup_Portals
Date: 2012-02-15 19:15:20
Message-ID: 555.1329333320@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> On Tue, Feb 7, 2012 at 3:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hmm. It works fine if you issue an actual ROLLBACK command there,
>> so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
>> duplicating the full-fledged ROLLBACK code path. No time to dig further
>> right now though.

> It works OK for a ROLLBACK command because we create a new unnamed
> portal for the ROLLBACK command, silently dropping the old one if it
> already exists. Since the ROLLBACK command then runs successfully, we
> don't see the same assertion. Would it be safe to drop FAILED unnamed
> portals during AbortOutAnyTransaction ? May be it is if we can do that
> before creating a new portal for ROLLBACK command itself.

I poked at this some more, and noticed another dangerous-seeming issue:
at the time PortalCleanup is called, if it does get called, the portal's
stmts and sourceText are already pointing at garbage. The reason is
that exec_simple_query blithely does this:

/*
* We don't have to copy anything into the portal, because everything
* we are passing here is in MessageContext, which will outlive the
* portal anyway.
*/
PortalDefineQuery(portal,
NULL,
query_string,
commandTag,
plantree_list,
NULL);

That comment is a true statement for the normal successful path of
control, wherein we reach the PortalDrop a few dozen lines below.
But it's not true if the command suffers an error. We will mark
the portal PORTAL_FAILED right away (in one of various PG_CATCH
blocks) but we don't clean it up until another command arrives,
so the current contents of MessageContext are long gone when portal
cleanup happens.

It seems to me that the most reliable fix for these issues is to
institute the same policy for transitioning a portal to FAILED state
as we already have for transitions to DONE state, ie, we should run the
cleanup hook immediately. IOW it would be a good idea to create a
function MarkPortalFailed that is just like MarkPortalDone except for
the specific portal state transition, and use that instead of merely
setting portal->status = PORTAL_FAILED in error cleanup situations.
This would ensure that the cleanup hook gets to run before we possibly
cut the portal's memory out from under it. It's more than is probably
needed to resolve the specific crash you're reporting, but it seems
likely to forestall future issues.

I haven't actually tested such a fix yet, but will go do that now.

regards, tom lane