Re: Endless recovery

Lists: pgsql-patches
From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: pgsql-patches(at)postgresql(dot)org
Subject: Endless recovery
Date: 2008-02-11 08:29:39
Message-ID: 12208CE8-238D-4D39-948A-F67427B891D1@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Last week we have seen a problem with some horribly configured machine.
The disk filled up (bad FSM ;) ) and once this happened the sysadmi
killed the system (-9).
After two days PostgreSQL has still not started up and they tried to
restart it again and again making sure that the consistency check was
started over an over again (thus causing more and more downtime).
From the admi point of view there was no way to find out whether the
machine was actually dead or still recovering.

Here is a small patch which issues a log message indicating that the
recovery process can take ages.
Maybe this can prevent some admis from interrupting the recovery
process.

In our case, the recovery process took 3.5 days !!

best regards,

hans

--
Cybertec Schönig & Schönig GmbH
PostgreSQL Solutions and Support
Gröhrmühlgasse 26, 2700 Wiener Neustadt
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Hans-Juergen Schoenig" <postgres(at)cybertec(dot)at>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Endless recovery
Date: 2008-02-11 09:26:08
Message-ID: 47B014B0.3010400@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Hans-Juergen Schoenig wrote:
> Last week we have seen a problem with some horribly configured machine.
> The disk filled up (bad FSM ;) ) and once this happened the sysadmi killed the
> system (-9).
> After two days PostgreSQL has still not started up and they tried to restart it
> again and again making sure that the consistency check was started over an over
> again (thus causing more and more downtime).
> From the admi point of view there was no way to find out whether the machine
> was actually dead or still recovering.
>
> Here is a small patch which issues a log message indicating that the recovery
> process can take ages.
> Maybe this can prevent some admis from interrupting the recovery process.

Wait, are you saying that the time was spent in the rm_cleanup phase?
That sounds unbelievable. Surely the time was spent in the redo phase, no?

> In our case, the recovery process took 3.5 days !!

That's a ridiculously long time. Was this a normal recovery, not a PITR
archive recovery? Any idea why the recovery took so long? Given the max.
checkpoint timeout of 1h, I would expect that the recovery would take a
maximum of few hours even with an extremely write-heavy workload.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 09:44:20
Message-ID: 4F5A2B70-BB4F-4501-B420-4ADCDC17F268@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


On Feb 11, 2008, at 10:26 AM, Heikki Linnakangas wrote:

> Hans-Juergen Schoenig wrote:
>> Last week we have seen a problem with some horribly configured
>> machine.
>> The disk filled up (bad FSM ;) ) and once this happened the
>> sysadmi killed the system (-9).
>> After two days PostgreSQL has still not started up and they tried
>> to restart it again and again making sure that the consistency
>> check was started over an over again (thus causing more and more
>> downtime).
>> From the admi point of view there was no way to find out whether
>> the machine was actually dead or still recovering.
>> Here is a small patch which issues a log message indicating that
>> the recovery process can take ages.
>> Maybe this can prevent some admis from interrupting the recovery
>> process.
>
> Wait, are you saying that the time was spent in the rm_cleanup
> phase? That sounds unbelievable. Surely the time was spent in the
> redo phase, no?

it was a seek heavy workload, with backtraces like this one

testdb(at)testdb:/srv/daten$ gdb --batch -x gdb.cmd /usr/local/
pgsql-8.1.11/bin/postmaster 15727
Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
[New Thread -1214167376 (LWP 15727)]
0xffffe410 in __kernel_vsyscall ()
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7c55c53 in read () from /lib/tls/i686/cmov/libc.so.6
#2 0x081b11e2 in FileRead (file=54, buffer=0xa7bb7540 "1",
amount=8192) at fd.c:1015
#3 0x081be33f in mdread (reln=0x8371e90, blocknum=6553044,
buffer=0xa7bb7540 "1") at md.c:497
#4 0x081be88a in smgrread (reln=0x8371e90, blocknum=6553044,
buffer=0xa7bb7540 "1") at smgr.c:527
#5 0x081ae4af in ReadBuffer (reln=0x834f0d0, blockNum=6553044) at
bufmgr.c:252
#6 0x080b9869 in XLogReadBuffer (extend=0 '\0', reln=0x834f0d0,
blkno=6553044) at xlogutils.c:57
#7 0x0808b3a2 in gistXLogReadAndLockBuffer (r=0x834f0d0,
blkno=6553044) at gistxlog.c:530
#8 0x08087641 in gistFindPath (r=0x834f0d0, child=8558565,
myReadBuffer=0x808b380 <gistXLogReadAndLockBuffer>) at gist.c:677
#9 0x0808c7bb in gist_xlog_cleanup () at gistxlog.c:549
#10 0x080b7c5c in StartupXLOG () at xlog.c:4769
#11 0x080c2635 in BootstrapMain (argc=4, argv=<value optimized out>)
at bootstrap.c:428
#12 0x0819a2cc in StartChildProcess (xlop=2) at postmaster.c:3581
#13 0x0819cb38 in PostmasterMain (argc=1, argv=0x83275a8) at
postmaster.c:931
#14 0x0815f909 in main (argc=1, argv=0x6a27e4b4) at main.c:265

redo was done fast ...

[2008-02-07 22:24:50 CET ]LOG: database system was interrupted while
in recovery at 2008-02-04 11:09:04 CET
[2008-02-07 22:24:50 CET ]HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.
[2008-02-07 22:24:50 CET ]LOG: checkpoint record is at 35/3BAA131C
[2008-02-07 22:24:50 CET ]LOG: redo record is at 35/3BA11AB4; undo
record is at 0/0; shutdown FALSE
[2008-02-07 22:24:50 CET ]LOG: next transaction ID: 194549334; next
OID: 16586886
[2008-02-07 22:24:50 CET ]LOG: next MultiXactId: 1; next
MultiXactOffset: 0
[2008-02-07 22:24:50 CET ]LOG: database system was not properly shut
down; automatic recovery in progress
[2008-02-07 22:24:50 CET ]LOG: redo starts at 35/3BA11AB4
[2008-02-07 22:24:53 CET ]LOG: record with zero length at 35/3C8317C8
[2008-02-07 22:24:53 CET ]LOG: redo done at 35/3C8317A0
10:40 AM

note that redo was finished fast ...

>
>> In our case, the recovery process took 3.5 days !!
>
> That's a ridiculously long time. Was this a normal recovery, not a
> PITR archive recovery? Any idea why the recovery took so long?
> Given the max. checkpoint timeout of 1h, I would expect that the
> recovery would take a maximum of few hours even with an extremely
> write-heavy workload.
>

there was no PITR in place - just a normal DB ...
my first idea when they called me was that it must be related to
checkpoint_segments - maybe some 2mio segments and some insanely long
timeout. but no, this was not the case. segments were at 12 and the
timeout was just several minutes. basically from "outside" everything
was looking fine ...

we used a binary copy of the data on two boxes (one for debugging).
the entire process worked like a charm - it just took ages.
we have seen a lot of random I/O here.

this was quite a small machine with insanely small memory settings.
no errors have been issued during the process - all fine; just long ...

the DB version is 8.1.11. The entire DB is 116gb. It is more or less
a table along with a 65 GB Gist index.

hans

--
Cybertec Schönig & Schönig GmbH
PostgreSQL Solutions and Support
Gröhrmühlgasse 26, 2700 Wiener Neustadt
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 10:02:37
Message-ID: 1202724157.4247.165.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2008-02-11 at 09:29 +0100, Hans-Juergen Schoenig wrote:
> Last week we have seen a problem with some horribly configured
> machine.
> The disk filled up (bad FSM ;) ) and once this happened the sysadmi
> killed the system (-9).
> After two days PostgreSQL has still not started up and they tried to
> restart it again and again making sure that the consistency check was
> started over an over again (thus causing more and more downtime).
> From the admi point of view there was no way to find out whether the
> machine was actually dead or still recovering.

I'm sorry to hear about this problem.

Not sure we need a LOG message to warn people about the possible length
of recovery time. The chances of a recovery taking that much time seem
very low for normal Postgres, even with checkpoint parameters set at
their maximum values.

I note that the configuration section does not mention the likely
increase in recovery time that will result from setting those parameters
higher. That needs a patch. ISTM a serious omission that should be
treated as a bug and backpatched.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Hans-Juergen Schoenig" <postgres(at)cybertec(dot)at>, "Teodor Sigaev" <teodor(at)postgresql(dot)org>
Cc: <pgsql-patches(at)postgreSQL(dot)org>
Subject: Re: Endless recovery
Date: 2008-02-11 10:16:30
Message-ID: 47B0207E.6090607@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Hans-Juergen Schoenig wrote:
> this is he last info which was issued ...
>
> nothing in between ...
>
> during the rm_cleanup() nothing was logged into the logs. this is the
> last log from today dawn:
>
> [2008-02-11 03:45:16 CET ]LOG: lost parent for block 8558565
> [2008-02-11 03:45:16 CET ]LOG: index 1663/16384/16578435 needs VACUUM or
> REINDEX to finish crash recovery
> [2008-02-11 03:45:16 CET ]DETAIL: Incomplete insertion detected during
> crash replay.
> [2008-02-11 03:47:54 CET ]LOG: database system is ready
> [2008-02-11 03:47:54 CET ]LOG: transaction ID wrap limit is 1073742476,
> limited by database "blids"
>
> that's where it finished, nothing else was logged between the "redo
> done" and the last log messages

I bet you've bumped into a bug in gist redo code, the cleanup phase
shouldn't take long. It's just for completing any incomplete splits by
inserting pointers to upper-level pages, and there shouldn't be more
than a few of those active at any point in time.

It looks like there's been quite a few changes to gistlog.c that haven't
been back-patched. This one in particular might be relevant here:

> revision 1.15
> date: 2006-04-03 17:45:50 +0100; author: tgl; state: Exp; lines: +15 -13;
> Fix thinko in gistRedoPageUpdateRecord: if XLR_BKP_BLOCK_1 is set, we
> don't have anything to do to the page, but we still have to adjust the
> incomplete_inserts list that we're maintaining in memory.

Teodor, what do you think?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 10:50:00
Message-ID: 1D1D7200-4C5C-46DA-A517-70CEB623208A@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

>
> I'm sorry to hear about this problem.

No worries - nothing lost ...

>
> Not sure we need a LOG message to warn people about the possible
> length
> of recovery time. The chances of a recovery taking that much time seem
> very low for normal Postgres, even with checkpoint parameters set at
> their maximum values.

It should be quite low - this is what we expected.

>
> I note that the configuration section does not mention the likely
> increase in recovery time that will result from setting those
> parameters
> higher. That needs a patch. ISTM a serious omission that should be
> treated as a bug and backpatched.
>

I have that feeling too - it could very well be some Gist issue in
here but given what we have seen in the debugger this was not too
obvious.
At first glance it rather felt like a full check of the entire index.

best regards,

hans

--
Cybertec Schönig & Schönig GmbH
PostgreSQL Solutions and Support
Gröhrmühlgasse 26, 2700 Wiener Neustadt
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 11:09:42
Message-ID: 1202728182.16770.20.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2008-02-11 at 10:44 +0100, Hans-Juergen Schoenig wrote:
> On Feb 11, 2008, at 10:26 AM, Heikki Linnakangas wrote:

> > Wait, are you saying that the time was spent in the rm_cleanup
> > phase? That sounds unbelievable. Surely the time was spent in the
> > redo phase, no?

> redo was done fast ...

...

> the DB version is 8.1.11. The entire DB is 116gb. It is more or less a
> table along with a 65 GB Gist index.

rmgrs might need to issue their own warnings, but the main recovery
system doesn't.

Slightly OT, but reading this I definitely want to split resource
managers into "essential" and "non-essential" groups, so we can come up
when essential operations are complete. I want the non-essential rmgrs
to be cancellable, plus give them the opportunity to rebuild corrupt
data structures without interfering with other aspects of the DB.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 11:17:49
Message-ID: 1202728669.16770.24.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

On Mon, 2008-02-11 at 11:50 +0100, Hans-Juergen Schoenig wrote:

> I have that feeling too - it could very well be some Gist issue in
> here but given what we have seen in the debugger this was not too
> obvious.
> At first glance it rather felt like a full check of the entire index.

Yeh, I'm happy for resource managers to do heavy lifting when they think
its the right thing to do (only they know), I just want to make sure
that doesn't affect anybody else. (Thinking to the future, not as a bug
fix).

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Hans-Juergen Schoenig" <postgres(at)cybertec(dot)at>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Teodor Sigaev" <teodor(at)postgresql(dot)org>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-02-11 16:54:08
Message-ID: 2882.1202748848@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Hans-Juergen Schoenig wrote:
>> that's where it finished, nothing else was logged between the "redo
>> done" and the last log messages

> I bet you've bumped into a bug in gist redo code, the cleanup phase
> shouldn't take long.

That's what it smells like to me too. Any chance that you still have
the WAL log for examination?

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Endless recovery
Date: 2008-03-05 17:00:57
Message-ID: 200803051700.m25H0vu24693@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Simon Riggs wrote:
> On Mon, 2008-02-11 at 09:29 +0100, Hans-Juergen Schoenig wrote:
> > Last week we have seen a problem with some horribly configured
> > machine.
> > The disk filled up (bad FSM ;) ) and once this happened the sysadmi
> > killed the system (-9).
> > After two days PostgreSQL has still not started up and they tried to
> > restart it again and again making sure that the consistency check was
> > started over an over again (thus causing more and more downtime).
> > From the admi point of view there was no way to find out whether the
> > machine was actually dead or still recovering.
>
> I'm sorry to hear about this problem.
>
> Not sure we need a LOG message to warn people about the possible length
> of recovery time. The chances of a recovery taking that much time seem
> very low for normal Postgres, even with checkpoint parameters set at
> their maximum values.
>
> I note that the configuration section does not mention the likely
> increase in recovery time that will result from setting those parameters
> higher. That needs a patch. ISTM a serious omission that should be
> treated as a bug and backpatched.

Patch attached and applied, and backpatched to 8.3.X.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachment Content-Type Size
/rtmp/diff text/x-diff 1.6 KB