Quick Links

Re: [GENERAL] PANIC: heap_update_redo: no block

Lists:	pgsql-generalpgsql-hackers

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	PANIC: heap_update_redo: no block
Date:	2006-03-19 06:55:35
Message-ID:	e0bf43760603182255q7eee6109n3f2afa0b10d3af71@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hi.

Upon rebooting one of our main production database servers, we were
greeted with this:

(@)<2006-03-18 23:30:32.687 MST>[3791]LOG: database system was
interrupted while in recovery at 2006-03-18 23:30:26 MST
(@)<2006-03-18 23:30:32.687 MST>[3791]HINT: This probably means that
some data is corrupted and you will have to use the last backup for
recovery.
(@)<2006-03-18 23:30:32.688 MST>[3791]LOG: checkpoint record is at D/1919D5F0
(@)<2006-03-18 23:30:32.688 MST>[3791]LOG: redo record is at
D/191722C8; undo record is at 0/0; shutdown FALSE
(@)<2006-03-18 23:30:32.688 MST>[3791]LOG: next transaction ID:
81148900; next OID: 16566476
(@)<2006-03-18 23:30:32.688 MST>[3791]LOG: next MultiXactId: 1; next
MultiXactOffset: 0
(@)<2006-03-18 23:30:32.689 MST>[3791]LOG: database system was not
properly shut down; automatic recovery in progress
(@)<2006-03-18 23:30:33.032 MST>[3791]LOG: redo starts at D/191722C8
(@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block
(@)<2006-03-18 23:30:33.036 MST>[3790]LOG: startup process (PID 3791)
was terminated by signal 6
(@)<2006-03-18 23:30:33.036 MST>[3790]LOG: aborting startup due to
startup process failure

As far as i know the postgresql was shutdown properly before the
reboot (pg_ctl stop -m fast). Though im not positive, I was just
brought in when they couldn't figure out why postgresql would not
start. Any ideas as to how this happened or how to fix it?

Right now im copying over the database, and then going to try
pg_resetxlog. Just to make sure, the only possible lost data are
things that are/would be in the xlog right? So i dont need to go
looking at all the tables, just ones I know were modified then.

Are there any other solutions that dont involve possibly loosing data?
(Yes I know backups, unfortunately the last back up was about 2 hours
ago and is not as up to date as i would like)

Just curious, ive also been investigating pitr for instead of doing
backups every 2 hours. If this problem were to surface when i was
using pitr as a backup solution, would all my data then be hosed (or
at least what pg_resetlog can not restore)?

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Alex bahdushka <bahdushka(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-19 09:56:46
Message-ID:	20060319095646.GA30346@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote:
> Hi.
>
> Upon rebooting one of our main production database servers, we were
> greeted with this:

To help you at all, we *really* need to know what version and platform
you're running. In particular, are you running the most recent release
of your branch. There have been bug fixes related to WAL recovery in
some versions...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Alex bahdushka" <bahdushka(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-19 16:57:50
Message-ID:	e0bf43760603190857l17a758b6h1c49d80b8f06ba90@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 3/19/06, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
> On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote:
> > Hi.
> >
> > Upon rebooting one of our main production database servers, we were
> > greeted with this:
>
> To help you at all, we *really* need to know what version and platform
> you're running. In particular, are you running the most recent release
> of your branch. There have been bug fixes related to WAL recovery in
> some versions...

Ahh of course! sorry!

select version();
version
--------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc-3.4 (GCC)
3.4.4 20050314 (prerelease) (Debian 3.4.3-13)

> Have a nice day,
> --
> Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> > tool for doing 5% of the work and then sitting around waiting for someone
> > else to do the other 95% so you can sue them.
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
>
> iD8DBQFEHSreIB7bNG8LQkwRAo5fAJ9TGnk9HFyx7DKsL5C5bSvtmMQIPgCdEVhr
> 7nTMsnzvyk1TI5az+GDCx9I=
> =zZR+
> -----END PGP SIGNATURE-----
>
>
>

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Alex bahdushka" <bahdushka(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-19 17:37:38
Message-ID:	e0bf43760603190937j683dcen2197e4262553222e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 3/19/06, Alex bahdushka <bahdushka(at)gmail(dot)com> wrote:
> On 3/19/06, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
> > On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote:
> > > Hi.
> > >
> > > Upon rebooting one of our main production database servers, we were
> > > greeted with this:
> >
> > To help you at all, we *really* need to know what version and platform
> > you're running. In particular, are you running the most recent release
> > of your branch. There have been bug fixes related to WAL recovery in
> > some versions...
>
> Ahh of course! sorry!
>
> select version();
> version
> --------------------------------------------------------------------------------------------------------------------
> PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc-3.4 (GCC)
> 3.4.4 20050314 (prerelease) (Debian 3.4.3-13)
>

After doing some more digging, it looks like that server was missing
the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown
(shutdown -h now)... my guess is it got a sigterm (you know where it
says Sending all processes a TERM signal or whatever), then it (init)
waited 5 seconds or whatever the timeout is and sent a sigkill.

If postgresql took longer to shutdown than that timeout and so was
then given a sigkill and then server turned off.... Could that do it?
(not to mention i don't exactly remember where file system get
unmounted, before or after it sends out those signals, i think its
before though so it might have mounted it read only (couldn't of
unmounted it because it was in use by postgresql)).

Im mainly asking because i would love for this to be user error. It
scares the hell out of me (and my boss obviously). Though i must say
for the 2+ years we have been using postgresql its proven to be very
stable, robust and fast.

Thanks!

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-20 11:54:17
Message-ID:	dvm5bj$12l$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
>
> After doing some more digging, it looks like that server was missing
> the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown
> (shutdown -h now)... my guess is it got a sigterm (you know where it
> says Sending all processes a TERM signal or whatever), then it (init)
> waited 5 seconds or whatever the timeout is and sent a sigkill.
>
> If postgresql took longer to shutdown than that timeout and so was
> then given a sigkill and then server turned off.... Could that do it?
>

I don't believe in this explaination actually. According the startup
message, the error "heap_update_redo: no block" could most possibly happen
when PostgreSQL tried to read an existing block but found that the file
length is not long enough to have it. How could a SIGKILL truncate a data
file like that?

Regards,
Qingqing

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-20 17:54:46
Message-ID:	e0bf43760603200954p57dc0e83x6f956895fa3df957@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 3/20/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>
> ""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
> >
> > After doing some more digging, it looks like that server was missing
> > the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown
> > (shutdown -h now)... my guess is it got a sigterm (you know where it
> > says Sending all processes a TERM signal or whatever), then it (init)
> > waited 5 seconds or whatever the timeout is and sent a sigkill.
> >
> > If postgresql took longer to shutdown than that timeout and so was
> > then given a sigkill and then server turned off.... Could that do it?
> >
>
> I don't believe in this explaination actually. According the startup
> message, the error "heap_update_redo: no block" could most possibly happen
> when PostgreSQL tried to read an existing block but found that the file
> length is not long enough to have it. How could a SIGKILL truncate a data
> file like that?
>

Hrm... well i obviously have restored the database by now (using
pg_resetxlog; pg_dump; initdb; pg_restore). However i did make a
backup of the broken directory before I created the new database. If
anyone has any thing they would like me to try to possibly help track
down this possible bug. I would be more than glad to do it.

Since it sounds like its something wrong with the xlog here is the
contents of the dir... Im not sure how useful this is but here it is
anyways.

pg_xlog# du -ak
16404 ./000000010000000D00000022
16404 ./000000010000000D0000001E
16404 ./000000010000000D00000019
16404 ./000000010000000D0000001A
16404 ./000000010000000D0000001D
16404 ./000000010000000D0000001C
16404 ./000000010000000D00000020
16404 ./000000010000000D00000021
16404 ./000000010000000D0000001B
4 ./archive_status
16404 ./000000010000000D00000023
16404 ./000000010000000D0000001F
16404 ./000000010000000D00000018
196856 .

They are all the same size, so it does not look like a truncated
file... Or am i just misinterpreting the error message and its one of
the files elsewhere?

The file system is ext3 and it fscked fine, and nothing is the the
lost+found dir/. As far as i know the computer was allowed to sync
the buffers to disk before the reboot (the plug was not pulled or
anything).

Any Ideas?

Otherwise it sounds like ill just have to chalk this one up to the
gods, and hope its fixed in 8.1.4....

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-21 09:35:46
Message-ID:	dvohju$gaf$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
>
> Hrm... well i obviously have restored the database by now (using
> pg_resetxlog; pg_dump; initdb; pg_restore). However i did make a
> backup of the broken directory before I created the new database. If
> anyone has any thing they would like me to try to possibly help track
> down this possible bug. I would be more than glad to do it.
>

pg_resetxlog is the last resort to "avoid" the real problem. Once you reset
it, then the xlog after that some offset will not get replayed (so the
problem disappered) and possibly some data get lost :-(.

Can you patch the heap/heapam.c/heap_xlog_update() like this:

- elog(PANIC, "heap_update_redo: no block");
+ elog(PANIC, "heap_update_redo: no block: target block: %u, relation
length: %u",
+ ItemPointerGetBlockNumber(&(xlrec->target.tid)),
+ RelationGetNumberOfBlocks(reln));

And restart your database to see what's the output?

Regards,
Qingqing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-21 17:07:48
Message-ID:	17851.1142960868@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> Can you patch the heap/heapam.c/heap_xlog_update() like this:

> - elog(PANIC, "heap_update_redo: no block");
> + elog(PANIC, "heap_update_redo: no block: target block: %u, relation
> length: %u",
> + ItemPointerGetBlockNumber(&(xlrec->target.tid)),
> + RelationGetNumberOfBlocks(reln));

> And restart your database to see what's the output?

While at it, you should extend the error message to include the relation
ID, so you have some idea which table is affected ... this is certainly
not a very informative message ...

regards, tom lane

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-22 03:18:40
Message-ID:	dvqfso$1mgg$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> While at it, you should extend the error message to include the relation
> ID, so you have some idea which table is affected ... this is certainly
> not a very informative message ...
>

Exactly. Please use the following version:

- elog(PANIC, "heap_update_redo: no block");
+ elog(PANIC, "heap_update_redo: no block: target blcknum: %u,
relation(%u/%u/%u) length: %u",
+ ItemPointerGetBlockNumber(&(xlrec->target.tid)),
+ reln->rd_node.spcNode,
+ reln->rd_node.dbNode,
+ reln->rd_node.relNode,
+ RelationGetNumberOfBlocks(reln));

BTW: I just realized that there is another (better) way to do so is to
enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't
have much error message in xlog redo.

Regards,
Qingqing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-22 03:56:02
Message-ID:	9709.1142999762@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> BTW: I just realized that there is another (better) way to do so is to
> enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't
> have much error message in xlog redo.

That was probably Vadim's original reasoning for not being very verbose
in the redo routines' various PANIC messages. But for failures in the
field it'd be awfully nice to be able to see this info from a standard
build, so I'm thinking we should improve the elog messages. If you feel
like creating a patch I'll be glad to apply it ...

regards, tom lane

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-22 04:32:32
Message-ID:	dvqk7f$2980$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> > BTW: I just realized that there is another (better) way to do so is to
> > enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we
don't
> > have much error message in xlog redo.
>
> That was probably Vadim's original reasoning for not being very verbose
> in the redo routines' various PANIC messages. But for failures in the
> field it'd be awfully nice to be able to see this info from a standard
> build, so I'm thinking we should improve the elog messages. If you feel
> like creating a patch I'll be glad to apply it ...
>

So there are three ways to do it:
(1) enable WALD_DEBUG by default
So it is user's reponsibility to enable XLOG_DEBUG to print verbose
information while at error. This adds no cost during normal running, but the
problem is that too much information (only the last is useful) may pollute
the log file.

(2) print verbose information after errror
We can change StartupXLOG like this:

PG_TRY();
{
RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);
}
PG_CATCH();
{
RmgrTable[record->xl_rmid].rm_desc(buf,
record->xl_info, XLogRecGetData(record));
abort();
}
PG_END_CATCH();

Also, channge err_finish() so that if PANIC if InRecovery, do a
PG_RE_THROW(). The problem is this looks like a little bit hack.

(3) Replace elog in every xlog_ABC_redo() to some xlog_elog(), so the
description information can be automatically appended.

I vote for method 2.

Regards,
Qingqing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-22 20:06:18
Message-ID:	19159.1143057978@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> (2) print verbose information after errror

If you want to do it that way, the correct thing is for the WAL replay
logic to add an error context hook to print the current WAL record,
not invent weird hacks in the error processing logic.
Compare the way bufmgr.c reports errors during smgrwrite() operations
(buffer_write_error_callback()).

One problem with the approach is that I'm not sure how robust the WAL
record description routines really are. They seem at the least
vulnerable to buffer-overflow issues. A crash while trying to describe
the current record wouldn't be a net improvement :-(. I'd kind of
want to modify them to have a more robust API, eg write into a
StringInfo instead of an unspecified-size buffer.

regards, tom lane

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-22 21:17:38
Message-ID:	e0bf43760603221317j79910b34nad3205c29eb3778d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 3/21/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
> >
> > While at it, you should extend the error message to include the relation
> > ID, so you have some idea which table is affected ... this is certainly
> > not a very informative message ...
> >
>
> Exactly. Please use the following version:
>
> - elog(PANIC, "heap_update_redo: no block");
> + elog(PANIC, "heap_update_redo: no block: target blcknum: %u,
> relation(%u/%u/%u) length: %u",
> + ItemPointerGetBlockNumber(&(xlrec->target.tid)),
> + reln->rd_node.spcNode,
> + reln->rd_node.dbNode,
> + reln->rd_node.relNode,
> + RelationGetNumberOfBlocks(reln));
>
> BTW: I just realized that there is another (better) way to do so is to
> enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't
> have much error message in xlog redo.
>

Hrm ok ill see about doing either the patch or setting wal_debug to
true (or both). However im currently on vacation till Saturday so ill
do this first thing then and report the results back. Thank you very
much!

> Regards,
> Qingqing
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-26 03:57:22
Message-ID:	e0bf43760603251957r5b10e334j33149e3c20a0e8d8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

<snip>

> Hrm ok ill see about doing either the patch or setting wal_debug to
> true (or both). However im currently on vacation till Saturday so ill
> do this first thing then and report the results back. Thank you very
> much!

Ok here is the output with wal_debug set to 1 in the config and the
final supplied patch by Qingqing.

(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: database system was
interrupted while in recovery at 2006-03-25 20:51:58 MST
(@)<2006-03-25 20:54:17.509 MST>[26571]HINT: This probably means that
some data is corrupted and you will have to use the last backup for
recovery.
(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: checkpoint record is at D/1919D5F0
(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: redo record is at
D/191722C8; undo record is at 0/0; shutdown FALSE
(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: next transaction ID:
81148900; next OID: 16566476
(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: next MultiXactId: 1; next
MultiXactOffset: 0
(@)<2006-03-25 20:54:17.509 MST>[26571]LOG: database system was not
properly shut down; automatic recovery in progress
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: redo starts at D/191722C8
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191722C8; LSN
D/19172678: prev D/191722A0; xid 81148908: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172678; LSN
D/191726B8: prev D/191722C8; xid 81148908: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191726B8; LSN
D/19172788: prev D/19172678; xid 81148908: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172788; LSN
D/191727C8: prev D/191726B8; xid 81148908: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191727C8; LSN
D/19172A30: prev D/19172788; xid 81148908: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172A30; LSN
D/19172A70: prev D/191727C8; xid 81148908: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172A70; LSN
D/19172F68: prev D/19172A30; xid 81148908: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172F68; LSN
D/19172FA8: prev D/19172A70; xid 81148908: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172FA8; LSN
D/19172FD0: prev D/19172F68; xid 81148908: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19172FD0; LSN
D/19173088: prev D/19172FA8; xid 81148942: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173088; LSN
D/191730C8: prev D/19172FD0; xid 81148942: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191730C8; LSN
D/191731E8: prev D/19173088; xid 81148942: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191731E8; LSN
D/19173228: prev D/191730C8; xid 81148942: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173228; LSN
D/19173250: prev D/191731E8; xid 81148942: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173250; LSN
D/19173414: prev D/19173228; xid 81148944: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173414; LSN
D/19173454: prev D/19173250; xid 81148944: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173454; LSN
D/19173760: prev D/19173414; xid 81148944: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173760; LSN
D/191737A0: prev D/19173454; xid 81148944: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191737A0; LSN
D/191737C8: prev D/19173760; xid 81148944: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191737C8; LSN
D/1917380C: prev D/191737A0; xid 81148946: Heap - update: rel
1663/16386/16865; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/1917380C; LSN
D/19173834: prev D/191737C8; xid 81148946: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173834; LSN
D/191739B8: prev D/1917380C; xid 81148948: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191739B8; LSN
D/191739F8: prev D/19173834; xid 81148948: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191739F8; LSN
D/19173AE4: prev D/191739B8; xid 81148948: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173AE4; LSN
D/19173B24: prev D/191739F8; xid 81148948: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173B24; LSN
D/19173B4C: prev D/19173AE4; xid 81148948: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173B4C; LSN
D/19173C38: prev D/19173B24; xid 81148960: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173C38; LSN
D/19173C78: prev D/19173B4C; xid 81148960: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173C78; LSN
D/19173EFC: prev D/19173C38; xid 81148960: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173EFC; LSN
D/19173F3C: prev D/19173C78; xid 81148960: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173F3C; LSN
D/19173F64: prev D/19173EFC; xid 81148960: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173F64; LSN
D/19174064: prev D/19173F3C; xid 81148962: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174064; LSN
D/191740A4: prev D/19173F64; xid 81148962: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191740A4; LSN
D/19174294: prev D/19174064; xid 81148962: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174294; LSN
D/191742D4: prev D/191740A4; xid 81148962: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191742D4; LSN
D/191742FC: prev D/19174294; xid 81148962: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191742FC; LSN
D/191743E8: prev D/191742D4; xid 81148964: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191743E8; LSN
D/19174428: prev D/191742FC; xid 81148964: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174428; LSN
D/191746FC: prev D/191743E8; xid 81148964: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191746FC; LSN
D/1917473C: prev D/19174428; xid 81148964: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/1917473C; LSN
D/19174764: prev D/191746FC; xid 81148964: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174764; LSN
D/19174824: prev D/1917473C; xid 81148966: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174824; LSN
D/19174864: prev D/19174764; xid 81148966: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174864; LSN
D/191749A4: prev D/19174824; xid 81148966: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191749A4; LSN
D/191749E4: prev D/19174864; xid 81148966: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191749E4; LSN
D/19174A0C: prev D/191749A4; xid 81148966: Transaction - commit:
2006-03-18 22:51:5
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174A0C; LSN
D/19174D18: prev D/191749E4; xid 81148968: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174D18; LSN
D/19174D58: prev D/19174A0C; xid 81148968: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174D58; LSN
D/19174E60: prev D/19174D18; xid 81148968: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174E60; LSN
D/19174EA0: prev D/19174D58; xid 81148968: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174EA0; LSN
D/19174EC8: prev D/19174E60; xid 81148968: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174EC8; LSN
D/1917508C: prev D/19174EA0; xid 81148970: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/1917508C; LSN
D/191750CC: prev D/19174EC8; xid 81148970: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191750CC; LSN
D/19175194: prev D/1917508C; xid 81148970: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175194; LSN
D/191751D4: prev D/191750CC; xid 81148970: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191751D4; LSN
D/191752C8: prev D/19175194; xid 81148970: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191752C8; LSN
D/19175308: prev D/191751D4; xid 81148970: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175308; LSN
D/191753D0: prev D/191752C8; xid 81148970: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191753D0; LSN
D/19175410: prev D/19175308; xid 81148970: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175410; LSN
D/19175438: prev D/191753D0; xid 81148970: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175438; LSN
D/19175524: prev D/19175410; xid 81148972: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175524; LSN
D/19175564: prev D/19175438; xid 81148972: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175564; LSN
D/19175844: prev D/19175524; xid 81148972: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175844; LSN
D/19175884: prev D/19175564; xid 81148972: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175884; LSN
D/1917596C: prev D/19175844; xid 81148972: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/1917596C; LSN
D/191759AC: prev D/19175884; xid 81148972: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191759AC; LSN
D/191759D4: prev D/1917596C; xid 81148972: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191759D4; LSN
D/19175B98: prev D/191759AC; xid 81148974: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175B98; LSN
D/19175BD8: prev D/191759D4; xid 81148974: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175BD8; LSN
D/19175EE4: prev D/19175B98; xid 81148974: Heap - update: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175EE4; LSN
D/19175F24: prev D/19175BD8; xid 81148974: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175F24; LSN
D/19175F4C: prev D/19175EE4; xid 81148974: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175F4C; LSN
D/19176104: prev D/19175F24; xid 81148976: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176104; LSN
D/19176144: prev D/19175F4C; xid 81148976: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176144; LSN
D/19176230: prev D/19176104; xid 81148976: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176230; LSN
D/19176270: prev D/19176144; xid 81148976: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176270; LSN
D/19176298: prev D/19176230; xid 81148976: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176298; LSN
D/19176384: prev D/19176270; xid 81148978: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176384; LSN
D/191763C4: prev D/19176298; xid 81148978: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191763C4; LSN
D/191765A8: prev D/19176384; xid 81148978: Heap - insert: rel
1663/16386/2619; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191765A8; LSN
D/191765E8: prev D/191763C4; xid 81148978: Btree - insert: rel
1663/16386/2696; tid
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191765E8; LSN
D/19176610: prev D/191765A8; xid 81148978: Transaction - commit:
2006-03-18 22:52:0
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176610; LSN
D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel
1663/16386/16559898; b
(@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176644; LSN
D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel
1663/16386/16559898; ti
(@)<2006-03-25 20:54:17.528 MST>[26571]PANIC: heap_update_redo: no
block: target blcknum: 1, relation(1663/16386/16559898) length: 1
(@)<2006-03-25 20:54:17.529 MST>[26570]LOG: startup process (PID
26571) was terminated by signal 6
(@)<2006-03-25 20:54:17.530 MST>[26570]LOG: aborting startup due to
startup process failure

Ok so whats next?

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-27 04:40:31
Message-ID:	e07qid$1vtl$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
>
> (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176610; LSN
> D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel
> 1663/16386/16559898; b
> (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176644; LSN
> D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel
> 1663/16386/16559898; ti
> (@)<2006-03-25 20:54:17.528 MST>[26571]PANIC: heap_update_redo: no
> block: target blcknum: 1, relation(1663/16386/16559898) length: 1
> (@)<2006-03-25 20:54:17.529 MST>[26570]LOG: startup process (PID
> 26571) was terminated by signal 6
>

It is like the problem due to the confliction of vacuum & update. The update
is on the page that vacuum was just removed.

Before we try to understand/attack the bug exactly, first I'd like to see
the complete xlog output. Your xlog output is imcomplete -- for example, the
first line after "b" should be "lk %u", the second line after "ti" should be
"d %u%u". Can you check out why the output looks like that? If you want to
examine the source code, look at heapam.c/heap_desc().

Regards,
Qingqing

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-27 05:03:50
Message-ID:	e0bf43760603262103l7b0e1d8cy37320a95300e811a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> It is like the problem due to the confliction of vacuum & update. The update
> is on the page that vacuum was just removed.
>
> Before we try to understand/attack the bug exactly, first I'd like to see
> the complete xlog output. Your xlog output is imcomplete -- for example, the
> first line after "b" should be "lk %u", the second line after "ti" should be
> "d %u%u". Can you check out why the output looks like that? If you want to
> examine the source code, look at heapam.c/heap_desc().

Gah, yeah it looks like a stupid copy and paste error, im attaching it
this time... hope thats ok

Attachment	Content-Type	Size
postmaster.err	application/octet-stream	10.6 KB

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-27 12:52:11
Message-ID:	e08nc2$2jvd$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
>
> LOG: REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979:
> Heap - clean: rel 1663/16386/16559898; blk 0
> LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979:
> Heap - move: rel 1663/16386/16559898; tid 1/1; new 0/10
> PANIC: heap_update_redo: no block: target blcknum: 1,
> relation(1663/16386/16559898) length: 1
>
What happened that time is like this: heap 16559898 has two pages {0, 1}. A
vacuum full first examines page 0, do some cleaning; then it comes to the
second page, move the item 1/1 to page 0 -- as a byproduct, the heap is
truncated to only 1 page since 1/1 is the only item on the second page. At
this time, system crashed, and heap 16559898 has only 1 page left. At the
xlog startup time, for some unknown reason( or I am totally wrong),
PostgreSQL didn't extend the heap to 2 blocks, thus the
heap_update_redo(more exactly, should be heap_move_redo) failed. But what's
the unknown reason, I really don't have a clue :-(.

Actually I tried to simulate your situation, but everytime I got a neat
recovery -- the only carvet is that depends on the XLOG_SMGR_TRUNCATE
written down or not, there may one extra useless page at the end of heap,
which is not a problem at all.

Regards,
Qingqing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
Cc:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-27 15:22:49
Message-ID:	14448.1143472969@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

[ redirecting to a more appropriate mailing list ]

"Alex bahdushka" <bahdushka(at)gmail(dot)com> writes:

> LOG: REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel 1663/16386/16559898; blk 0
> LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid 1/1; new 0/10
> PANIC: heap_update_redo: no block: target blcknum: 1, relation(1663/16386/16559898) length: 1

I think what's happened here is that VACUUM FULL moved the only tuple
off page 1 of the relation, then truncated off page 1, and now
heap_update_redo is panicking because it can't find page 1 to replay the
move. Curious that we've not seen a case like this before, because it
seems like a generic hazard for WAL replay.

The simplest fix would be to treat WAL records as no-ops if they refer
to nonexistent pages, but that seems much too prone to hide real failure
conditions. Another thought is to remember that we ignored this record,
and then complain if we don't see a TRUNCATE that would've removed the
page. That would be pretty complicated but not impossible. Anyone have
a better idea?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-27 16:45:04
Message-ID:	15168.1143477904@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> Actually I tried to simulate your situation, but everytime I got a neat
> recovery --

You probably filled the test table and then vacuumed within a single
checkpoint cycle, so that the replay sequence included loading data into
page 1 in the first place. The risk case is fill table, checkpoint,
vacuum, crash; because then the replay starts from the checkpoint and
won't re-create page 1.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-27 22:01:49
Message-ID:	87lkuvtulu.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I think what's happened here is that VACUUM FULL moved the only tuple
> off page 1 of the relation, then truncated off page 1, and now
> heap_update_redo is panicking because it can't find page 1 to replay the
> move. Curious that we've not seen a case like this before, because it
> seems like a generic hazard for WAL replay.

This sounds familiar

http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-27 22:09:13
Message-ID:	17887.1143497353@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> This sounds familiar
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

Hm, I had totally forgotten about that todo item :-(. Time to push it
back up the priority list.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 03:03:59
Message-ID:	26340.1143515039@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> I think what's happened here is that VACUUM FULL moved the only tuple
>> off page 1 of the relation, then truncated off page 1, and now
>> heap_update_redo is panicking because it can't find page 1 to replay the
>> move. Curious that we've not seen a case like this before, because it
>> seems like a generic hazard for WAL replay.

> This sounds familiar
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

After further review I've concluded that there is not a systemic bug
here, but there are several nearby local bugs. The reason it's not
a systemic bug is that this scenario is supposed to be handled by the
same mechanism that prevents torn-page writes: the first XLOG record
that touches a given page after a checkpoint is supposed to rewrite
the entire page, rather than update it incrementally. Since XLOG replay
always begins at a checkpoint, this means we should always be able to
write a fresh copy of the page, even after relation deletion or
truncation. Furthermore, during XLOG replay we are willing to create
a table (or even a whole tablespace or database directory) if it's not
there when touched. The subsequent replay of the deletion or truncation
will get rid of any unwanted data again.

Therefore, there is no systemic bug --- unless you are running with
full_page_writes=off. I assert that that GUC variable is broken and
must be removed.

There are, however, a bunch of local bugs, including these:

* On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is
#ifdef'd to be a no-op. This is wrong because it performs the essential
function of re-creating a tablespace or database directory if needed
during replay. AFAICS the #if can just be removed and have the same
code with or without symlinks.

* log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead
of storing the full destination page, if the destination contains only
the single tuple being moved. This is fine, except it also resets the
buffer indicator for the *source* page, which is wrong --- that page
may still need to be re-generated from the xlog record. This is the
proximate cause of the bug report that started this thread.

* btree_xlog_split passes extend=false to XLogReadBuffer for the left
sibling, which is silly because it is going to rewrite that whole page
from the xlog record anyway. It should pass true so that there's no
complaint if the left sib page was later truncated away. This accounts
for one of the bug reports mentioned in the message cited above.

* btree_xlog_delete_page passes extend=false for the target page,
which is likewise silly because it's going to init the page (not that
there was any useful data on it anyway). This accounts for the other
bug report mentioned in the message cited above.

Clearly, we need to go through the xlog code with a fine tooth comb
and convince ourselves that all pages touched by any xlog record will
be properly reconstituted if they've later been truncated off. I have
not yet examined any of the code except the above.

Notice that these are each, individually, pretty low-probability
scenarios, which is why we've not seen many bug reports. If we had had
a systemic bug I'm sure we'd be seeing far more.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 10:01:27
Message-ID:	1143540087.3839.304.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> I think what's happened here is that VACUUM FULL moved the only tuple
> >> off page 1 of the relation, then truncated off page 1, and now
> >> heap_update_redo is panicking because it can't find page 1 to replay the
> >> move. Curious that we've not seen a case like this before, because it
> >> seems like a generic hazard for WAL replay.
>
> > This sounds familiar
> > http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php

Yes, I remember that also.

> After further review I've concluded that there is not a systemic bug
> here, but there are several nearby local bugs.

IMHO that's amazing to find so many bugs in a code review of existing
production code. Cool.

> The reason it's not
> a systemic bug is that this scenario is supposed to be handled by the
> same mechanism that prevents torn-page writes: the first XLOG record
> that touches a given page after a checkpoint is supposed to rewrite
> the entire page, rather than update it incrementally. Since XLOG replay
> always begins at a checkpoint, this means we should always be able to
> write a fresh copy of the page, even after relation deletion or
> truncation. Furthermore, during XLOG replay we are willing to create
> a table (or even a whole tablespace or database directory) if it's not
> there when touched. The subsequent replay of the deletion or truncation
> will get rid of any unwanted data again.

That will all work, agreed.

> The subsequent replay of the deletion or truncation
> will get rid of any unwanted data again.

Trouble is, it is not a watertight assumption that there *will be* a
subsequent truncation, even if it is a strong one. If there is not a
later truncation, we will just ignore what we ought to now know is an
error and then try to continue as if the database was fine, which it
would not be.

The overall problem is that auto extension fails to take action or
provide notification with regard to file system corruptions. Clearly we
would like xlog replay to work even in the face of strong file
corruptions, but we should make attempts to identify this situation and
notify people that this has occurred.

I'd suggest both WARNING messages in the log and something more extreme
still: anyone touching a corrupt table should receive a NOTICE saying
"database recovery displayed errors for this table" "HINT: check the
database logfiles for specific messages". Indexes should have a log
WARNING saying "database recovery displayed errors for this index"
"HINT: use REINDEX to rebuild this index".

So I guess I had better help if we agree this is beneficial.

> Therefore, there is no systemic bug --- unless you are running with
> full_page_writes=off. I assert that that GUC variable is broken and
> must be removed.

On this analysis, I would agree for current production systems. But what
this says is something deeper: we must log full pages, not because we
fear a partial page write has occurred, but because the xlog mechanism
intrinsically depends upon the existence of those full pages after each
checkpoint.

The writing of full pages in this way is a serious performance issue
that it would be good to improve upon. Perhaps this is the spur to
discuss a new xlog format that would support higher performance logging
as well as log-mining for replication?

> There are, however, a bunch of local bugs, including these:

...

> Notice that these are each, individually, pretty low-probability
> scenarios, which is why we've not seen many bug reports.

Most people don't file bug reports. If we have a recovery mode that
ignores file system corruptions we'll get even less because any errors
that occur will be deemed as gamma rays or some other excuse.

> a systemic bug

Perhaps we do have one systemic problem: systems documentation.

The xlog code is distinct from other parts of the codebase in that it
has almost zero comments with it and the overall mechanisms are
relatively poorly documented in README form. Methinks there are very few
people who could attempt such a code review and even fewer who would
find any bugs by inspection. I'll think some more on that...

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 15:07:35
Message-ID:	1689.1143558455@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
>> The subsequent replay of the deletion or truncation
>> will get rid of any unwanted data again.

> Trouble is, it is not a watertight assumption that there *will be* a
> subsequent truncation, even if it is a strong one.

Well, in fact we'll have correctly recreated the page, so I'm not
thinking that it's necessary or desirable to check this. What's the
point? "PANIC: we think your filesystem screwed up. We don't know
exactly how or why, and we successfully rebuilt all our data, but
we're gonna refuse to start up anyway." Doesn't seem like robust
behavior to me. If you check the archives you'll find that we've
backed off panic-for-panic's-sake behaviors in replay several times
before, after concluding they made the system less robust rather than
more so. This just seems like another one of the same.

> Perhaps we do have one systemic problem: systems documentation.

I agree on that ;-). The xlog code is really poorly documented.
I'm going to try to improve the comments for at least the xlogutils
routines while I'm fixing this.

regards, tom lane

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 16:12:09
Message-ID:	20060328161209.GJ75181@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> >> The subsequent replay of the deletion or truncation
> >> will get rid of any unwanted data again.
>
> > Trouble is, it is not a watertight assumption that there *will be* a
> > subsequent truncation, even if it is a strong one.
>
> Well, in fact we'll have correctly recreated the page, so I'm not
> thinking that it's necessary or desirable to check this. What's the
> point? "PANIC: we think your filesystem screwed up. We don't know
> exactly how or why, and we successfully rebuilt all our data, but
> we're gonna refuse to start up anyway." Doesn't seem like robust
> behavior to me. If you check the archives you'll find that we've
> backed off panic-for-panic's-sake behaviors in replay several times
> before, after concluding they made the system less robust rather than
> more so. This just seems like another one of the same.

Would the suggestion made in
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help
in this regard? (Sorry, much of this is over my head, but not everyone
may have read that...)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 16:27:10
Message-ID:	2788.1143563230@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> On Tue, Mar 28, 2006 at 10:07:35AM -0500, Tom Lane wrote:
>> Well, in fact we'll have correctly recreated the page, so I'm not
>> thinking that it's necessary or desirable to check this.

> Would the suggestion made in
> http://archives.postgresql.org/pgsql-hackers/2005-05/msg01374.php help
> in this regard?

That's exactly what we are debating: whether it's still necessary/useful
to make such a check, given that we now realize the failures are just
isolated bugs and not a systemic problem with truncated files.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 18:11:53
Message-ID:	1143569513.32384.35.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, 2006-03-28 at 10:07 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2006-03-27 at 22:03 -0500, Tom Lane wrote:
> >> The subsequent replay of the deletion or truncation
> >> will get rid of any unwanted data again.
>
> > Trouble is, it is not a watertight assumption that there *will be* a
> > subsequent truncation, even if it is a strong one.
>
> Well, in fact we'll have correctly recreated the page, so I'm not
> thinking that it's necessary or desirable to check this. What's the
> point?

We recreated *a* page but we are shying away from exploring *why* we
needed to in the first place. If there was no later truncation then
there absolutely should have been a page there already and the fact
there wasn't one needs to be reported.

I don't want to write that code either, I just think we should.

> "PANIC: we think your filesystem screwed up. We don't know
> exactly how or why, and we successfully rebuilt all our data, but
> we're gonna refuse to start up anyway." Doesn't seem like robust
> behavior to me.

Agreed, which is why I explicitly said we shouldn't do that.

grass_up_filesystem = on should be the only setting we support, but
you're right we can't know why its wrong, but the sysadmin might.

> > Perhaps we do have one systemic problem: systems documentation.
>
> I agree on that ;-). The xlog code is really poorly documented.
> I'm going to try to improve the comments for at least the xlogutils
> routines while I'm fixing this.

I'll take a look also.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-28 19:44:21
Message-ID:	4523.1143575061@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

I wrote:
> * log_heap_update decides that it can set XLOG_HEAP_INIT_PAGE instead
> of storing the full destination page, if the destination contains only
> the single tuple being moved. This is fine, except it also resets the
> buffer indicator for the *source* page, which is wrong --- that page
> may still need to be re-generated from the xlog record. This is the
> proximate cause of the bug report that started this thread.

I have to retract that particular bit of analysis: I had misread the
log_heap_update code. It seems to be doing the right thing, and in any
case, given Alex's output

LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; tid 1/1; new 0/10

we can safely conclude that log_heap_update did not set the INIT_PAGE
bit, because the "new" tid doesn't have offset=1. (The fact that the
WAL_DEBUG printout doesn't report the bit's state is an oversight I plan
to fix, but anyway we can be pretty sure it's not set here.)

What we should be seeing, and don't see, is an indication of a backup
block attached to this WAL record. Furthermore, I don't see any
indication of a backup block attached to *any* of the WAL records in
Alex's printout. The only conclusion I can draw is that he had
full_page_writes turned OFF, and as we have just realized that that
setting is completely unsafe, that is the explanation for his failure.

> Clearly, we need to go through the xlog code with a fine tooth comb
> and convince ourselves that all pages touched by any xlog record will
> be properly reconstituted if they've later been truncated off. I have
> not yet examined any of the code except the above.

I've finished going through the xlog code looking for related problems,
and AFAICS this is the score:

* full_page_writes = OFF doesn't work.

* btree_xlog_split and btree_xlog_delete_page should pass TRUE not FALSE
to XLogReadBuffer for all pages that they are going to re-initialize.

* the recently-added gist xlog code is badly broken --- it pays no
attention whatever to preventing torn pages :-(. It's not going to be
easy to fix, either, because the page split code assumes that a single
WAL record can describe changes to any number of pages, which is not
the case.

Everything else seems to be getting it right.

regards, tom lane

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-03-29 02:59:29
Message-ID:	e0ctco$fl1$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> What we should be seeing, and don't see, is an indication of a backup
> block attached to this WAL record. Furthermore, I don't see any
> indication of a backup block attached to *any* of the WAL records in
> Alex's printout. The only conclusion I can draw is that he had
> full_page_writes turned OFF, and as we have just realized that that
> setting is completely unsafe, that is the explanation for his failure.
>

This might be the answer. I tried the fill-checkpoint-vacuum-crash sequence
as you suggested, but still a neat recovery. That's because, IMHO, even
after checkpoint, the moved page will still be saved into WAL (since it is
new again to the checkpoint) if full_page_writes is on.

Regards,
Qingqing

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-31 09:16:59
Message-ID:	e0is8n$21qc$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
>
> (@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block
>

According to the discussion in pgsql-hackers, to finish this case, did you
turn off the full_page_writes parameter? I hope the answer is "yes" ...

Regards,
Qingqing

From:	"Alex bahdushka" <bahdushka(at)gmail(dot)com>
To:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PANIC: heap_update_redo: no block
Date:	2006-03-31 17:20:29
Message-ID:	e0bf43760603310920w70956f6cx3e035dc5275fab26@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 3/31/06, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu> wrote:
>
> ""Alex bahdushka"" <bahdushka(at)gmail(dot)com> wrote
> >
> > (@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block
> >
>
> According to the discussion in pgsql-hackers, to finish this case, did you
> turn off the full_page_writes parameter? I hope the answer is "yes" ...
>

If by off you mean full_page_writes = on then yes.

Thanks for all your help!

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-04-01 12:34:21
Message-ID:	e0ls0h$1fdi$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> wrote
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>>
>> What we should be seeing, and don't see, is an indication of a backup
>> block attached to this WAL record. Furthermore, I don't see any
>> indication of a backup block attached to *any* of the WAL records in
>> Alex's printout. The only conclusion I can draw is that he had
>> full_page_writes turned OFF, and as we have just realized that that
>> setting is completely unsafe, that is the explanation for his failure.
>>
>

According to Alex, seems the problem is not because of full_page_writes OFF

>
>> According to the discussion in pgsql-hackers, to finish this case, did
>> you
>> turn off the full_page_writes parameter? I hope the answer is "yes" ...
>>
>
> If by off you mean full_page_writes = on then yes.

Regards,
Qingqing

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Alex bahdushka <bahdushka(at)gmail(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] PANIC: heap_update_redo: no block
Date:	2006-04-10 22:03:29
Message-ID:	200604102203.k3AM3TD12648@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> There are, however, a bunch of local bugs, including these:
>
> * On a symlink-less platform (ie, Windows), TablespaceCreateDbspace is
> #ifdef'd to be a no-op. This is wrong because it performs the essential
> function of re-creating a tablespace or database directory if needed
> during replay. AFAICS the #if can just be removed and have the same
> code with or without symlinks.

FYI, Win32 in Win2k and XP has symlinks implemented using junction
points, and we use them. It is just that pre-Win2k (NT4) does not have
them, and at this point pginstaller doesn't even support that platform.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +