Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-02-20 09:09:14
Message-ID: CAHGQGwGRuNJ=_ctXwteNkFkdvMDNFYxFdn0D1cd-CqL0OgNCLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 3:01 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I've tested your v9 patch.  I no longer see any inconsistencies or
> lost transactions in the recovered database.  But occasionally I get
> databases that fail to recover at all.
> It has always been with the exact same failed assertion, at xlog.c line 2154.
>
> I've only seen this 4 times out of 2202 cycles of crash and recovery,
> so it must be some rather obscure situation.
>
> LOG:  database system was not properly shut down; automatic recovery in progress
> LOG:  redo starts at 0/180001B0
> LOG:  unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672
> LOG:  redo done at 0/19083FD0
> LOG:  last completed transaction was at log time 2012-02-17 11:13:50.369488-08
> LOG:  checkpoint starting: end-of-recovery immediate
> TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid *
> (uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) *
> ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
> % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line:
> 2154)
> LOG:  startup process (PID 5390) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure

I could reproduce this when I made the server crash just after executing
"select pg_switch_xlog()".

$ initdb -D data
$ pg_ctl -D data start
$ psql -c "select pg_switch_xlog()"
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start
...
LOG: redo done at 0/16E3B0C
TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid *
(uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) *
((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192)
% (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line:
2154)
LOG: startup process (PID 16361) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

Though I've not read new patch yet, I doubt that xlog switch code would
still have a bug.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Mamin 2012-02-20 09:18:31 Re: Qual evaluation cost estimates for GIN indexes
Previous Message Jehan-Guillaume (ioguix) de Rorthais 2012-02-20 09:01:34 Re: Google Summer of Code? Call for mentors.