Re: buffer assertion tripping under repeat pgbench load

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: buffer assertion tripping under repeat pgbench load
Date: 2012-12-23 09:15:33
Message-ID: 20121223091533.GA1653@alap2.fritz.box
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2012-12-23 02:36:42 -0500, Greg Smith wrote:
> I'm testing a checkout from a few days ago and trying to complete a day long
> pgbench stress test, with assertions and debugging on. I want to make sure
> the base code works as expected before moving on to testing checksums. It's
> crashing before finishing though. Here's a sample:
>
> 2012-12-20 20:36:11 EST [26613]: LOG: checkpoint complete: wrote 32 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.108
> s, sync=0.978 s, total=4.187 s; sync files=7, longest=0.832 s, average=0.139
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-20 20:36:44 EST [26611]: LOG: server process (PID 4834) was
> terminated by signal 6: Aborted
> 2012-12-20 20:36:44 EST [26611]: DETAIL: Failed process was running: END;
>
> Running the same test set again gave the same crash, so this looks
> repeatable. It's not trivial to trigger given the average runtime to crash
> I'm seeing. Second sample:
>
> 2012-12-22 21:27:17 EST [6006]: LOG: checkpoint complete: wrote 34 buffers
> (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=3.310
> s, sync=2.906 s, total=6.424 s; sync files=7, longest=2.648 s, average=0.415
> s
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
> 2012-12-22 21:27:23 EST [26611]: LOG: server process (PID 12143) was
> terminated by signal 6: Aborted
> 2012-12-22 21:27:23 EST [26611]: DETAIL: Failed process was running: END;
>
> The important part:
>
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line:
> 1703)
>
> That's the check done by "AtEOXact_Buffers - clean up at end of
> transaction". It fired while executing the "END;"" statement at the end of
> the standard pgbench test. The test looks for loose things that weren't
> pinned/unpinned correctly by the transaction. I'm not sure if getting a
> stack trace at that point will be useful. The ref count mistake could
> easily have been made by an earlier statement in the transaction block,
> right?
>
> This is on a server where shared_buffers=2GB, checkpoint_segments=64. Test
> runs were automated with pgbench-tools, using the following configuration:
>
> MAX_WORKERS="4"
> SCRIPT="tpc-b.sql"
> SCALES="500 1000"
> SETCLIENTS="4 8 16 32 64 96"
> SETTIMES=3
> RUNTIME=600
>
> All of the crashes so far have been at scale=500, and I haven't seen those
> finish yet. It failed 7 hours in the first time, then 4 hours in.
>
> For code reference, the last commit in the source code tree I built against
> was:
>
> af275a12dfeecd621ed9899a9382f26a68310263
> Thu Dec 20 14:23:31 2012 +0200
>
> Looking at recent activity, the only buffer pin related changes I saw was
> this one:
>
> commit 62656617dbe49cca140f3da588a5e311b3fc35ea
> Date: Mon Dec 3 18:53:31 2012 +0000
> Avoid holding vmbuffer pin after VACUUM.

I don't immediately see anything bad here.

> This is my first test like this against 9.3 development though, so the cause
> could be an earlier commit. I'm just starting with the most recent work as
> the first suspect. Next I think I'll try autovacuum=off and see if the
> crash goes away. Other ideas are welcome.

Something like the (untested) debug message below might be helpful:

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7be767b..20f61a1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1700,7 +1700,13 @@ AtEOXact_Buffers(bool isCommit)

for (i = 0; i < NBuffers; i++)
{
- Assert(PrivateRefCount[i] == 0);
+ if (PrivateRefCount[i] != 0)
+ {
+ BufferDesc *bufHdr = &BufferDescriptors[i];
+ elog(PANIC, "refcount of %s is %u should be 0, globally: %u",
+ relpathbackend(bufHdr->tag.rnode, InvalidBackendId, bufHdr->tag.forkNum),
+ PrivateRefCount[i], bufHdr->refcount);
+ }
}
}
#endif

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2012-12-23 09:58:40 Re: pgcrypto seeding problem when ssl=on
Previous Message Greg Smith 2012-12-23 07:36:42 buffer assertion tripping under repeat pgbench load