Re: [PATCHES] Important 7.0.* fix to ensure buffers are released

Lists: pgsql-hackerspgsql-patches
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-patches(at)postgreSQL(dot)org
Subject: Important 7.0.* fix to ensure buffers are released
Date: 2000-08-30 21:29:51
Message-ID: 28223.967670991@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Hiroshi Inoue pointed out that Postgres neglects to do an explicit
transaction abort during backend shutdown. For example, in psql
begin;
declare myc cursor for select * from ..;
fetch in myc;
\q
would cause the backend to exit without having released the resources
acquired for the open transaction. This is OK from the point of view
of data integrity (other transactions will believe that the transaction
was aborted) but not OK if shared resources are left locked up. In
particular, this oversight probably accounts for the sporadic reports
we've seen of errors like

NOTICE: FlushRelationBuffers(all_flows, 500237): block 171439 is
referenced (private 0, global 1)
FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2

since shared buffer reference counts would not be released by an
exiting backend, leading to a complaint (perhaps much later) when
VACUUM checks that there are no references to the relation it's
trying to vacuum.

I have fixed this problem in current sources and back-patched the
fix into the 7.0.* branch. But I do not know when or if we'll have
a 7.0.3 release, so for anyone who's been annoyed by this problem
and doesn't want to wait, the patch for 7.0.* is attached.

regards, tom lane

*** src/backend/tcop/postgres.c.orig Sat May 20 22:23:30 2000
--- src/backend/tcop/postgres.c Wed Aug 30 16:47:51 2000
***************
*** 1459,1465 ****
* Initialize the deferred trigger manager
*/
if (DeferredTriggerInit() != 0)
! proc_exit(0);

SetProcessingMode(NormalProcessing);

--- 1459,1465 ----
* Initialize the deferred trigger manager
*/
if (DeferredTriggerInit() != 0)
! goto normalexit;

SetProcessingMode(NormalProcessing);

***************
*** 1479,1490 ****
TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");

AbortCurrentTransaction();
! InError = false;
if (ExitAfterAbort)
! {
! ProcReleaseLocks(); /* Just to be sure... */
! proc_exit(0);
! }
}

Warn_restart_ready = true; /* we can now handle elog(ERROR) */
--- 1479,1489 ----
TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction");

AbortCurrentTransaction();
!
if (ExitAfterAbort)
! goto errorexit;
!
! InError = false;
}

Warn_restart_ready = true; /* we can now handle elog(ERROR) */
***************
*** 1553,1560 ****
if (HandleFunctionRequest() == EOF)
{
/* lost frontend connection during F message input */
! pq_close();
! proc_exit(0);
}
break;

--- 1552,1558 ----
if (HandleFunctionRequest() == EOF)
{
/* lost frontend connection during F message input */
! goto normalexit;
}
break;

***************
*** 1608,1618 ****
*/
case 'X':
case EOF:
! if (!IsUnderPostmaster)
! ShutdownXLOG();
! pq_close();
! proc_exit(0);
! break;

default:
elog(ERROR, "unknown frontend message was received");
--- 1606,1612 ----
*/
case 'X':
case EOF:
! goto normalexit;

default:
elog(ERROR, "unknown frontend message was received");
***************
*** 1642,1651 ****
if (IsUnderPostmaster)
NullCommand(Remote);
}
! } /* infinite for-loop */

! proc_exit(0); /* shouldn't get here... */
! return 1;
}

#ifndef HAVE_GETRUSAGE
--- 1636,1655 ----
if (IsUnderPostmaster)
NullCommand(Remote);
}
! } /* end of main loop */
!
! normalexit:
! ExitAfterAbort = true; /* ensure we will exit if elog during abort */
! AbortOutOfAnyTransaction();
! if (!IsUnderPostmaster)
! ShutdownXLOG();
!
! errorexit:
! pq_close();
! ProcReleaseLocks(); /* Just to be sure... */
! proc_exit(0);

! return 1; /* keep compiler quiet */
}

#ifndef HAVE_GETRUSAGE


From: t-ishii(at)sra(dot)co(dot)jp
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Important 7.0.* fix to ensure buffers are released
Date: 2000-09-02 08:14:33
Message-ID: 20000902171433W.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Hiroshi Inoue pointed out that Postgres neglects to do an explicit
> transaction abort during backend shutdown. For example, in psql
> begin;
> declare myc cursor for select * from ..;
> fetch in myc;
> \q
> would cause the backend to exit without having released the resources
> acquired for the open transaction. This is OK from the point of view
> of data integrity (other transactions will believe that the transaction
> was aborted) but not OK if shared resources are left locked up. In
> particular, this oversight probably accounts for the sporadic reports
> we've seen of errors like
>
> NOTICE: FlushRelationBuffers(all_flows, 500237): block 171439 is
> referenced (private 0, global 1)
> FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
>
> since shared buffer reference counts would not be released by an
> exiting backend, leading to a complaint (perhaps much later) when
> VACUUM checks that there are no references to the relation it's
> trying to vacuum.

Interesting thing is that 6.5.x does not have the problem. Is it new
one for 7.0.x?

I remember that you have fixed some refcount leaks in 6.5.x. Could you
tell me any examples to demonstrate the cases in 6.5.x, those are
supposed to be fixed in 7.0.x? I just want to know what kind of
refcount leak problems existing in 6.5.x and 7.0.x.
--
Tatsuo Ishii


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: t-ishii(at)sra(dot)co(dot)jp
Cc: pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Important 7.0.* fix to ensure buffers are released
Date: 2000-09-02 16:37:56
Message-ID: 1174.967912676@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

t-ishii(at)sra(dot)co(dot)jp writes:
> Interesting thing is that 6.5.x does not have the problem. Is it new
> one for 7.0.x?

I think the bug has been there for a long time. It is easier to see
in 7.0.2 because VACUUM will now check for nonzero refcount on *all*
pages of the relation. Formerly, it only checked pages that it was
about to actually truncate from the relation. So it's possible for
an unreleased pin on a page to go unnoticed in 6.5 but generate a
complaint in 7.0.

Now that I look closely, I see that VACUUM still has a problem with
this in current sources: it only calls FlushRelationBuffers() if it
needs to shorten the relation. So pinned pages will not be reported
unless the file gets shortened by at least one page. This is a bug
because it means that pg_upgrade still can't trust VACUUM to ensure
that all on-row status bits are correct (see comments for
FlushRelationBuffers). I will change it to call FlushRelationBuffers
always.

> I remember that you have fixed some refcount leaks in 6.5.x. Could you
> tell me any examples to demonstrate the cases in 6.5.x, those are
> supposed to be fixed in 7.0.x?

I think the primary problems had to do with recursive calls to
ExecutorRun, which'd invoke the badly broken buffer refcount save/
restore mechanism that was present in 6.5 and earlier. This would
mainly be done by SQL and PL functions that do SELECTs. A couple
of examples:
* elog(ERROR) from inside an SQL function would mean that buffer
refcounts held by the outer scan wouldn't be released. So, eg,
SELECT sqlfunction(column1) FROM foo;
was a buffer leak risk.
* SQL functions returning sets could leak even without any elog(),
if the entire set result was not read for some reason.
There were probably some non-SQL-function cases that got fixed along the
way, but I don't have any concrete examples. See the pghacker threads
Anyone understand shared buffer refcount mechanism?
Progress report: buffer refcount bugs and SQL functions
from September 1999 for more info.

regards, tom lane


From: t-ishii(at)sra(dot)co(dot)jp
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Important 7.0.* fix to ensure buffers are released
Date: 2000-09-04 01:04:16
Message-ID: 20000904100416J.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

[ Cc: to hackers list]

> I think the primary problems had to do with recursive calls to
> ExecutorRun, which'd invoke the badly broken buffer refcount save/
> restore mechanism that was present in 6.5 and earlier. This would
> mainly be done by SQL and PL functions that do SELECTs. A couple
> of examples:
> * elog(ERROR) from inside an SQL function would mean that buffer
> refcounts held by the outer scan wouldn't be released. So, eg,
> SELECT sqlfunction(column1) FROM foo;
> was a buffer leak risk.

Following case doesn't produce notices from BlowawayRelationBuffers.

drop table t1;
create table t1(i int);
drop table t2;
create table t2(i int);
insert into t1 values(1);
drop function f1(int);
create function f1(int) returns int as '
select $1 +1;
select i from t2;
' language 'sql';
drop table t2;
select f1(i) from t1;
delete from t1;
vacuum t1;

Am I missing something?

> * SQL functions returning sets could leak even without any elog(),
> if the entire set result was not read for some reason.

However, following case produces:

NOTICE: BlowawayRelationBuffers(t1, 0): block 0 is referenced...

as expected.

drop table t1;
create table t1(i int);
insert into t1 values(1);
insert into t1 select i from t1;
insert into t1 select i from t1;
drop function f1(int);
create function f1(int) returns setof int as '
select i from t1;
' language 'sql';
select f1(i) from t1 limit 1 offset 0;
delete from t1;
vacuum analyze t1;

Interesting thing is that the select in above case produces a
notice in 7.0.2 (with or without your patches):

NOTICE: Buffer Leak: [059] (freeNext=-3, freePrev=-3, relname=t1, blockNum=0, flags=0x4, refcount=1 2)

while 6.5.3 does not. Maybe 6.5.3 failes to detect buffer leaks at
transaction commit time?
--
Tatsuo Ishii


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: t-ishii(at)sra(dot)co(dot)jp
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Important 7.0.* fix to ensure buffers are released
Date: 2000-09-04 02:42:14
Message-ID: 9102.968035334@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

t-ishii(at)sra(dot)co(dot)jp writes:
> Am I missing something?

I don't have 6.5.* running anymore to check, but it looked to me like
elog out of an SQL function would result in refcount leaks. But the
elog would have to occur while inside the function's recursive call
to ExecutorRun, so your example (which will detect its error during
query plan setup) doesn't exhibit the problem. Try something like
select 1/0;
inside the function.

> Interesting thing is that the select in above case produces a
> notice in 7.0.2 (with or without your patches):

Yes, still there in current sources. The leak comes from the fact
that the function's internal SELECT is never shut down because the
function isn't run to completion. This is one of the things I think we
need to fix during querytree redesign. However, 7.0 at least detects
and recovers from the leak, which is more than can be said for 6.5.

> while 6.5.3 does not. Maybe 6.5.3 failes to detect buffer leaks at
> transaction commit time?

In fact it does fail to detect them, in cases like this where the leak
is attributable to an uncompleted query inside a function call. That's
one of the things that was broken about the refcount save/restore
mechanism...

regards, tom lane


From: t-ishii(at)sra(dot)co(dot)jp
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: [PATCHES] Important 7.0.* fix to ensure buffers are released
Date: 2000-09-04 11:29:36
Message-ID: 20000904202936D.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> I don't have 6.5.* running anymore to check, but it looked to me like
> elog out of an SQL function would result in refcount leaks. But the
> elog would have to occur while inside the function's recursive call
> to ExecutorRun, so your example (which will detect its error during
> query plan setup) doesn't exhibit the problem. Try something like
> select 1/0;
> inside the function.

Oh, I see.

> > Interesting thing is that the select in above case produces a
> > notice in 7.0.2 (with or without your patches):
>
> Yes, still there in current sources. The leak comes from the fact
> that the function's internal SELECT is never shut down because the
> function isn't run to completion. This is one of the things I think we
> need to fix during querytree redesign. However, 7.0 at least detects
> and recovers from the leak, which is more than can be said for 6.5.

Agreed.

> > while 6.5.3 does not. Maybe 6.5.3 failes to detect buffer leaks at
> > transaction commit time?
>
> In fact it does fail to detect them, in cases like this where the leak
> is attributable to an uncompleted query inside a function call. That's
> one of the things that was broken about the refcount save/restore
> mechanism...

Understood.
--
Tatsuo Ishii


From: "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <t-ishii(at)sra(dot)co(dot)jp>
Cc: <pgsql-patches(at)postgreSQL(dot)org>
Subject: RE: Important 7.0.* fix to ensure buffers are released
Date: 2000-09-05 08:51:15
Message-ID: 001701c01716$733beb80$2801007e@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> -----Original Message-----
> From: Tom Lane
>
> t-ishii(at)sra(dot)co(dot)jp writes:
> > Interesting thing is that 6.5.x does not have the problem. Is it new
> > one for 7.0.x?
>
> I think the bug has been there for a long time. It is easier to see

One of the reason why we see the bug often in 7.0 seems to be
the following change which was applied to temprel.c before 7.0.
remove_all_temp_relations() always called AbortOutAnyTransaction()
before the change. remove_all_temp_relations() has been called from
shmem_exit() and accidentally(I don't think it had been intensional)
proc_exit() always called AbortOutAnyTransaction().

@@ -79,6 +79,9 @@
List *l,
*next;

+ if (temp_rels == NIL)
+ return;
+
AbortOutOfAnyTransaction();
StartTransactionCommand();

Regards.

Hiroshi Inoue


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp>
Cc: t-ishii(at)sra(dot)co(dot)jp, pgsql-patches(at)postgreSQL(dot)org
Subject: Re: Important 7.0.* fix to ensure buffers are released
Date: 2000-09-05 13:57:01
Message-ID: 9716.968162221@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
> One of the reason why we see the bug often in 7.0 seems to be
> the following change which was applied to temprel.c before 7.0.
> remove_all_temp_relations() always called AbortOutAnyTransaction()
> before the change.

Bingo! So actually there was an abort-transaction call buried in the
shutdown process. I wondered why we didn't see more problems...

Anyway, I've added an AbortOutOfAnyTransaction() call to postgres.c,
so the behavior should be more straightforward now.

regards, tom lane


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: Inoue(at)tpf(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [PATCHES] Important 7.0.* fix to ensure buffers are released
Date: 2000-09-06 08:17:06
Message-ID: 20000906171706A.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

[Cc:ed to hackers list]

> "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
> > One of the reason why we see the bug often in 7.0 seems to be
> > the following change which was applied to temprel.c before 7.0.
> > remove_all_temp_relations() always called AbortOutAnyTransaction()
> > before the change.
>
> Bingo! So actually there was an abort-transaction call buried in the
> shutdown process. I wondered why we didn't see more problems...
>
> Anyway, I've added an AbortOutOfAnyTransaction() call to postgres.c,
> so the behavior should be more straightforward now.

Are you going to make a back patch for the 7.0 tree?
--
Tatsuo Ishii


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: Inoue(at)tpf(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re: [PATCHES] Important 7.0.* fix to ensure buffers are released
Date: 2000-09-06 14:39:12
Message-ID: 14252.968251152@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> writes:
>> Anyway, I've added an AbortOutOfAnyTransaction() call to postgres.c,
>> so the behavior should be more straightforward now.

> Are you going to make a back patch for the 7.0 tree?

I did. See starting message of this thread ...

regards, tom lane