Re: Fwd: index corruption in PG 8.3.13

Lists: pgsql-hackers
From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: index corruption in PG 8.3.13
Date: 2011-03-09 12:42:22
Message-ID: AANLkTi=NPR9iRMFhojKAUwHOvVk+xjxv8H53KzNvrFmG@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I am currently looking at a physical file belonging to a corrupt
index. This is with PG 8.3.13 on an ext3 filesystem on suse linux with
a RAID1 setup. The sequence of operation seems nothing suspicious.

Create table, create a bunch of indexes on that table and do
insert/update activity. One difference here is that VACUUM FULL might
be run in between along with the usual autovacuum mechanism when the
inserts/updates are going on.

So one of the vacuum calls errored out with a corrupt (completely
zeroed out in this case) block with the usual hint to reindex.
This is a single machine setup with no slaves. Some analysis using
pg_filedump comes up with the following:

There are a total of 525 blocks (numbered 0 to 524). The root page is
block #3. The corrupt or zeroed out page is block #523.

The basic issue is that item2 in root block points to block 523 which
is completely zeroed out and this has broken the index chain
completely!

Other peculiarity in the index file is that we found a lot of zeroed
out pages. Blocks from #279 to #518 are all completely zeroed out
without any signs of even a page header. Any ideas on how we can get
so many zeroed out blocks? Apart from the extend code path, I fail to
see any other. And this is an unusually large number of zero pages...

Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
meaning they could be internal pages, but that is strange since ROOT
page is at level 1. Also importantly their next XID is set FrozenXid,
meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
hierarchy level or something?

I am attaching the data file, the binary and the 8.3 compliant
pg_filedump sources with this mail. Just copy the pg_filedump
directory into contrib and use Makefile.contrib to compile. Cannot
attach the output of running pg_filedump since my email will cross the
attachment limits. Am also attaching my dotty attempts via a png file
to visually show the scenario, hopefully that should make things
clearer.

Feedback, comments, insights appreciated. Appears to be a tough nut to
crack :( :), and although blaming the hardware for the zero pages
might come to mind, still the deletion activity done by Vacuum/Vacuum
Full seems a bit suspicious.

Regards,
Nikhils

Attachment Content-Type Size
index_corruption.tar.gz application/x-gzip 303.9 KB
index_corruption.png image/png 131.0 KB

From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 13:51:50
Message-ID: AANLkTik5tXoeuVjtcW0y+Yh46sETmH4bLo_6DQa-REYn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Re-sending without the attachments. Can someone please allow my
attachments through from the previous email?

TIA
Nikhils

---------- Forwarded message ----------
From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Date: Wed, Mar 9, 2011 at 8:42 PM
Subject: index corruption in PG 8.3.13
To: pgsql-hackers(at)postgresql(dot)org

Hi,

I am currently looking at a physical file belonging to a corrupt
index. This is with PG 8.3.13 on an ext3 filesystem on suse linux with
a RAID1 setup. The sequence of operation seems nothing suspicious.

Create table, create a bunch of indexes on that table and do
insert/update activity. One difference here is that VACUUM FULL might
be run in between along with the usual autovacuum mechanism when the
inserts/updates are going on.

So one of the vacuum calls errored out with a corrupt (completely
zeroed out in this case) block with the usual hint to reindex.
This is a single machine setup with no slaves. Some analysis using
pg_filedump comes up with the following:

There are a total of 525 blocks (numbered 0 to 524). The root page is
block #3. The corrupt or zeroed out page is block #523.

The basic issue is that item2 in root block points to block 523 which
is completely zeroed out and this has broken the index chain
completely!

Other peculiarity in the index file is that we found a lot of zeroed
out pages. Blocks from #279 to #518 are all completely zeroed out
without any signs of even a page header. Any ideas on how we can get
so many zeroed out blocks? Apart from the extend code path, I fail to
see any other. And this is an unusually large number of zero pages...

Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
meaning they could be internal pages, but that is strange since ROOT
page is at level 1. Also importantly their next XID is set FrozenXid,
meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
hierarchy level or something?

I am attaching the data file, the binary and the 8.3 compliant
pg_filedump sources with this mail. Just copy the pg_filedump
directory into contrib and use Makefile.contrib to compile. Cannot
attach the output of running pg_filedump since my email will cross the
attachment limits. Am also attaching my dotty attempts via a png file
to visually show the scenario, hopefully that should make things
clearer.

Feedback, comments, insights appreciated. Appears to be a tough nut to
crack :( :), and although blaming the hardware for the zero pages
might come to mind, still the deletion activity done by Vacuum/Vacuum
Full seems a bit suspicious.

Regards,
Nikhils


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 14:02:28
Message-ID: 1299679176-sup-8891@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Nikhil Sontakke's message of mié mar 09 10:51:50 -0300 2011:
> Re-sending without the attachments. Can someone please allow my
> attachments through from the previous email?

They are not in the moderation queue, so presumably they got eaten by
the antispam grue.

> Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
> meaning they could be internal pages, but that is strange since ROOT
> page is at level 1. Also importantly their next XID is set FrozenXid,
> meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
> hierarchy level or something?

Hierarchy level is never reduced.

I'll send you a perl program we wrote for a customer to check for
strange issues in btrees. Please give it a spin; it may give you more
clues. If you find additional checks to add, please let me know!

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 14:16:22
Message-ID: AANLkTikEkMxO6XSVc9j8=E3JWhU=FCDtJwWeruOwhhDd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Re-sending without the attachments. Can someone please allow my
>> attachments through from the previous email?
>
> They are not in the moderation queue, so presumably they got eaten by
> the antispam grue.

Ouch. Attempting to attach the dotty image again..

>
>
>> Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
>> meaning they could be internal pages, but that is strange since ROOT
>> page is at level 1. Also importantly their next XID is set FrozenXid,
>> meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
>> hierarchy level or something?
>
> Hierarchy level is never reduced.
>
> I'll send you a perl program we wrote for a customer to check for
> strange issues in btrees.  Please give it a spin; it may give you more
> clues.  If you find additional checks to add, please let me know!
>

Ok, thanks a lot!

Regards,
Nikhils

Attachment Content-Type Size
index_corruption.png image/png 131.0 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 14:35:40
Message-ID: 1299681299-sup-9110@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Nikhil Sontakke's message of mié mar 09 11:16:22 -0300 2011:
> >> Re-sending without the attachments. Can someone please allow my
> >> attachments through from the previous email?
> >
> > They are not in the moderation queue, so presumably they got eaten by
> > the antispam grue.
>
> Ouch. Attempting to attach the dotty image again..

I don't understand this graph. What are the arrows? Downlinks or
sibling pointers?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 14:38:26
Message-ID: AANLkTikGEKYsdKDLPEvB2U8JY5sFzcAzr9ydi0+3Ft13@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Ouch. Attempting to attach the dotty image again..
>
> I don't understand this graph.  What are the arrows?  Downlinks or
> sibling pointers?
>

Sorry, they are sibling previous and next pointers. The ROOT is at
level 1, rest all live blocks are at level 0. #524 is the leftmost
page.

Regards,
Nikhils


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-09 23:28:19
Message-ID: AANLkTi=wwfFdB33zuFijDR5SzCos=EoEMViSoNPnAfDV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

>>> Blocks 519 to 521 are DELETED. They do not have the LEAF flag set,
>>> meaning they could be internal pages, but that is strange since ROOT
>>> page is at level 1. Also importantly their next XID is set FrozenXid,
>>> meaning VACUUM FULL was at play. Maybe due to deletes, we reduced the
>>> hierarchy level or something?
>>
>> Hierarchy level is never reduced.
>>
>> I'll send you a perl program we wrote for a customer to check for
>> strange issues in btrees.  Please give it a spin; it may give you more
>> clues.  If you find additional checks to add, please let me know!
>>
>

I tried to run Alvaro's perl script, but since the index chain is
broken due to the zeroed out page pretty early on, it could not
traverse it very well.

While I rummage around the code more, does anyone have any theories on
the below?

"Other peculiarity in the index file is that we found a lot of zeroed
out pages. Blocks from #279 to #518 are all completely zeroed out
without any signs of even a page header. Any ideas on how we can get
so many zeroed out blocks? Apart from the extend code path, I fail to
see any other. And this is an unusually large number of zero pages"

Comments appreciated.

Regards,
Nikhils


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 00:14:40
Message-ID: AANLkTimZSyjWGOsNEHxdLanMv7K4BaGbhHnpxgz1V8hp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 11:28 PM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> "Other peculiarity in the index file is that we found a lot of zeroed
> out pages. Blocks from #279 to #518 are all completely zeroed out
> without any signs of even a page header. Any ideas on how we can get
> so many zeroed out blocks? Apart from the extend code path, I fail to
> see any other. And this is an unusually large number of zero pages"
>

What does stat say for the index data file? Are the Size and Blocks
values the same (modulo block size)? Or are these blocks actually not
allocated?

Postgres always forces blocks to be allocated but if they were lost
due to filesystem corruption maybe they're not allocated any more.

--
greg


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 00:24:30
Message-ID: AANLkTincXZsrWOcghjcqba7xcwCfM7kYck-v-zdgN7od@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> What does stat say for the index data file? Are the Size and Blocks
> values the same (modulo block size)? Or are these blocks actually not
> allocated?
>
stat 58401
File: `58401'
Size: 4300800 Blocks: 8400 IO Block: 4096 regular file
Device: 801h/2049d Inode: 13901264 Links: 1

modulo 8192 it gives 525 which is the number of blocks that we see.

> Postgres always forces blocks to be allocated but if they were lost
> due to filesystem corruption maybe they're not allocated any more.
>

Actually these blocks are like a big hole in between. Blocks 279 to
518 are zeroed out and then 519 to 524 is index data. Number 523 is
the one which has been zeroed out too causing all the mayhem.

Regards,
Nikhils


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 00:50:37
Message-ID: 1299717506-sup-7371@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Nikhil Sontakke's message of mié mar 09 20:28:19 -0300 2011:

> While I rummage around the code more, does anyone have any theories on
> the below?
>
> "Other peculiarity in the index file is that we found a lot of zeroed
> out pages. Blocks from #279 to #518 are all completely zeroed out
> without any signs of even a page header. Any ideas on how we can get
> so many zeroed out blocks? Apart from the extend code path, I fail to
> see any other. And this is an unusually large number of zero pages"

Well, if you see the root block child pointers, they go to pages
2, 4-277,522-524. So pages 278 to 521 seem unused for some reason,
which is rather strange. But apparently only page 523 is zeroed and
shouldn't be.

It seems hard to believe that there would be 245 unsuccessful attempts
at extending the file.

Page 522 is suspect too ... why does it have a single item? Note its
LSN is very close to the one on page 521.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 11:44:48
Message-ID: AANLkTimhqmM0SWVh3HzqU2BDx+gyyUBW50788XJxZkwi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

>> "Other peculiarity in the index file is that we found a lot of zeroed
>> out pages. Blocks from #279 to #518 are all completely zeroed out
>> without any signs of even a page header. Any ideas on how we can get
>> so many zeroed out blocks? Apart from the extend code path, I fail to
>> see any other. And this is an unusually large number of zero pages"
>
> Well, if you see the root block child pointers, they go to pages
> 2, 4-277,522-524.  So pages 278 to 521 seem unused for some reason,
> which is rather strange.  But apparently only page 523 is zeroed and
> shouldn't be.
>

Yeah, the very definition of strange all this.

> It seems hard to believe that there would be 245 unsuccessful attempts
> at extending the file.
>
> Page 522 is suspect too ... why does it have a single item?  Note its
> LSN is very close to the one on page 521.
>

If you look at my png, you will notice that there is a deleted block
chain at play around 522.

Deleted 519's previous points to Deleted 520.

Deleted 520's previous points to Deleted 521.

Deleted 521's previous points to 522

Want to know where the next of all these 4 blocks point to? Block 277.

Another interesting thing is all these Deleted blocks have next XID
set to FroxzenXID. And who sets it to that - only VACUUM FULL. That's
why my suspicions around VF..

Regards,
Nikhils


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 13:45:22
Message-ID: AANLkTinUaGCL491fu==7h5FYJ-=hV7LqORP8QhHbZLgK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 7:14 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Mar 9, 2011 at 11:28 PM, Nikhil Sontakke
> <nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>> "Other peculiarity in the index file is that we found a lot of zeroed
>> out pages. Blocks from #279 to #518 are all completely zeroed out
>> without any signs of even a page header. Any ideas on how we can get
>> so many zeroed out blocks? Apart from the extend code path, I fail to
>> see any other. And this is an unusually large number of zero pages"
>>
>
> What does stat say for the index data file? Are the Size and Blocks
> values the same (modulo block size)? Or are these blocks actually not
> allocated?
>
> Postgres always forces blocks to be allocated but if they were lost
> due to filesystem corruption maybe they're not allocated any more.

Hmm, that gives me an idea. What if the index relation were extended
repeatedly, but the blocks are still all zero on disk because they've
never been written out of shared_buffers? Then, before the next
checkpoint, the system crashes. All those blocks are now zero on
disk. In theory, WAL replay should fix most of it, but there could be
portions of the WAL that haven't been flushed; or there could be some
other problem in WAL replay. For example, it strikes me that this
situation would be consistent with:

1. Somebody inserts a bunch of new tuples into the relation, causing
growth in the index.
2. Before the blocks are written to disk, system crash.
3. pg_resetxlog

But it wouldn't have to be exactly that. Like what if during replay,
you hit a WAL record that had a bit flipped so it failed the checksum.
I think the system would just treat that as end-of-WAL and start up.
Now you could have some corruption, and a big chunk of zero blocks in
the index. Now you go along in normal running, maybe not realizing
that those zero blocks are there, and start extending the relation as
you need to update the index...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 17:52:34
Message-ID: AANLkTinFKsP_yv0vswBjW4c1Mx9fhwNUE13_HYLzPujv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 10, 2011 at 1:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 1. Somebody inserts a bunch of new tuples into the relation, causing
> growth in the index.

In case it's not obvious VACUUM FULL would do precisely that.

> 2. Before the blocks are written to disk, system crash.
> 3. pg_resetxlog

Do you have any record of the VACUUM FULL committing? Do you know what
LSN its commit record is at?

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 18:29:45
Message-ID: AANLkTimmSyxhkBckE8H+YnqtHLgvqeQF4r5UhShS8ggw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 10, 2011 at 12:52 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Thu, Mar 10, 2011 at 1:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> 1. Somebody inserts a bunch of new tuples into the relation, causing
>> growth in the index.
>
> In case it's not obvious VACUUM FULL would do precisely that.

Oh, I didn't even think about that. Yeah, that could be it, too. So
maybe VACUUM FULL - crash before checkpoint - problem with recovery?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-11 11:17:58
Message-ID: AANLkTimSXMshgbfe0vc=Z4TG-p31R6MPM_YQBXCCw3qK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

>>> 1. Somebody inserts a bunch of new tuples into the relation, causing
>>> growth in the index.
>>
>> In case it's not obvious VACUUM FULL would do precisely that.
>
> Oh, I didn't even think about that.  Yeah, that could be it, too.

Thanks a lot Greg and Robert. This theory seems very plausible. VF
must have carried out a rearrangement of heap tuples for compaction
and that might have caused new index entries which might explain the
extension of that many blocks.

> maybe VACUUM FULL - crash before checkpoint - problem with recovery?

Did I mention that an immediate shutdown was thrown into the mix just
after the VF? That is almost close to a crash and that means that the
shared buffers were not written back to the index data file. So that
should also account for all these pages still being zeroed out. So
change the above to:

> VACUUM FULL - immediate shutdown - problem with recovery?

But WAL replay should still have handled this. I would presume even an
immediate shutdown ensures that WAL is flushed to disk properly?

So that means that either there is a corner case bug in VF which adds
incorrect WAL logging in some specific btree layout scenarios or there
was indeed some bit flipping in the WAL, which caused the recovery to
prematurely end during WAL replay. What are the scenarios that you
would think can cause WAL bit flipping?

I was trying to repro this on the setup by repeatedly creating a table
with large inserts, doing lotta deletes, running VF and then issuing
immediate shutdown. However if I try to inspect the index data file at
this point in the test case, it is inconsequential as the file is
largely out of sync since its dirty shared buffers have not been
flushed. That leaves me with the option to restart and check the index
data file again for problems. If we see problems after the restart it
should generally mean WAL logging errors (but we still cannot discount
the bit flipping case I guess).

I guess a perusal of the WAL activity done by VF btree index activity
is the need of the hour..

Regards,
Nikhils


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-11 13:14:22
Message-ID: AANLkTi=F6XQkVRFpLLO+JwbnDT+=LXj5=hDh=gZHMxmD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 11, 2011 at 6:17 AM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>> VACUUM FULL - immediate shutdown - problem with recovery?

An immediate shutdown == an intentional crash. OK, so you have the
VACUUM FULL and the immediate shutdown just afterward. So we just
need to figure out what happened during recovery.

> But WAL replay should still have handled this. I would presume even an
> immediate shutdown ensures that WAL is flushed to disk properly?

I'm not sure, but I doubt it. If the VACUUM FULL committed, then the
WAL records should be on disk, but if the immediate shutdown happened
while it was still running, then the WAL records might still be in
wal_buffers, in which case I don't think they'll get written out and
thus zero pages in the index are to be expected. Now that doesn't
explain any other corruption in the file, but I believe all-zeroes
pages in a relation are an expected consequence of an unclean
shutdown. But assuming the VF actually committed before the immediate
shutdown, there must be something else going on, since by that point
XLOG should have been flushed.

> So that means that either there is a corner case bug in VF which adds
> incorrect WAL logging in some specific btree layout scenarios or there
> was indeed some bit flipping in the WAL, which caused the recovery to
> prematurely end during WAL replay. What are the scenarios that you
> would think can cause WAL bit flipping?

Some kind of fluke hard drive malfunction, maybe? I know that the
incidence of a hard drive being told to write A and actually writing B
is very low, but it's probably not exactly zero. Do you have the logs
from the recovery following the immediate shutdown? Anything
interesting there?

Or, as you say, there could be a corner-case VF bug.

> I was trying to repro this on the setup by repeatedly creating a table
> with large inserts, doing lotta deletes, running VF and then issuing
> immediate shutdown. However if I try to inspect the index data file at
> this point in the test case, it is inconsequential as the file is
> largely out of sync since its dirty shared buffers have not been
> flushed. That leaves me with the option to restart and check the index
> data file again for problems. If we see problems after the restart it
> should generally mean WAL logging errors (but we still cannot discount
> the bit flipping case I guess).

contrib/pageinspect might help.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-11 14:28:05
Message-ID: AANLkTin8jifMsE-jE8P9x-+BbpVsQD4uoxWkBbO3xQ3h@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> VACUUM FULL - immediate shutdown - problem with recovery?
>
> An immediate shutdown == an intentional crash. OK, so you have the
> VACUUM FULL and the immediate shutdown just afterward. So we just
> need to figure out what happened during recovery.
>

Right.

>> But WAL replay should still have handled this. I would presume even an
>> immediate shutdown ensures that WAL is flushed to disk properly?
>
> I'm not sure, but I doubt it. If the VACUUM FULL committed, then the
> WAL records should be on disk, but if the immediate shutdown happened
> while it was still running, then the WAL records might still be in
> wal_buffers, in which case I don't think they'll get written out and
> thus zero pages in the index are to be expected. Now that doesn't
> explain any other corruption in the file, but I believe all-zeroes
> pages in a relation are an expected consequence of an unclean
> shutdown. But assuming the VF actually committed before the immediate
> shutdown, there must be something else going on, since by that point
> XLOG should have been flushed.
>

Oh yeah, so if VF committed, the xlog should have been ok too, but
can't say the same about the shared buffers.

>> So that means that either there is a corner case bug in VF which adds
>> incorrect WAL logging in some specific btree layout scenarios or there
>> was indeed some bit flipping in the WAL, which caused the recovery to
>> prematurely end during WAL replay. What are the scenarios that you
>> would think can cause WAL bit flipping?
>
> Some kind of fluke hard drive malfunction, maybe? I know that the
> incidence of a hard drive being told to write A and actually writing B
> is very low, but it's probably not exactly zero. Do you have the logs
> from the recovery following the immediate shutdown? Anything
> interesting there?
>

Unfortunately we do not have the recovery logs. Would have loved to
see some signs about some issues in the WAL replay to confirm the
theory about bit flipping..

> Or, as you say, there could be a corner-case VF bug.
>

Yeah, much harder to find by just eyeballing the code I guess :)

Regards,
Nikhils


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-11 18:21:20
Message-ID: AANLkTik9ds9+a=JROSvEvqD5LK0tST7LnYvtgBnuOVQb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 11, 2011 at 2:28 PM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>> I'm not sure, but I doubt it.  If the VACUUM FULL committed, then the
>> WAL records should be on disk, but if the immediate shutdown happened
>> while it was still running, then the WAL records might still be in
>> wal_buffers, in which case I don't think they'll get written out and
>> thus zero pages in the index are to be expected.
>>...
>
> Oh yeah, so if VF committed, the xlog should have been ok too, but
> can't say the same about the shared buffers.

But there was a later block that *was* written out. What was the LSN
on that block? everything in the WAL log should have been fsynced up
to that point when that buffer was flushed.

Was there a machine restart in the picture as well?

--
greg


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-12 03:06:21
Message-ID: AANLkTim_SK+Ecv8c0ogLYUJu4gkOaMR3wNSG1Bzvd8Dc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Oh yeah, so if VF committed, the xlog should have been ok too, but
>> can't say the same about the shared buffers.
>
> But there was a later block that *was* written out. What was the LSN
> on that block? everything in the WAL log should have been fsynced up
> to that point when that buffer was flushed.
>

Which block are you referring to?
After the holes from 279 to 518. We have

Deleted 519's (LSN: logid 29, recoff 0xd1fab5bc) previous points to Deleted 520.

Deleted 520's (LSN: logid 29, recoff 0xd1fac918) previous points to Deleted 521.

Deleted 521's (LSN: logid 29, recoff 0xd1fadc60) previous points to 522

Note that all the above deleted blocks have next xid set to FrozenXid,
meaning VF did this.

Live 522's (LSN: logid 29, recoff 0xd1fade3c) previous points to
the zeroed out 523 block. Note that this seems to be latest LSN in the
data file.

The next of all these 4 blocks (519 to 522) point to Live Block 277
with (LSN: logid 29, recoff 0xd1fadc60). The ROOT block also has this
same LSN. So Blocks 521, 277 and block 3 have this same LSN. The rest
of the live blocks appear to have lower LSNs with lower logids.

The last block in the index data file, which also seems to be the
current left most block, number 524 has an LSN of (logid 29, recoff
0xd1fa97b8) with next sibling set to 523. One interesting observation
is that there is another deleted block 278 (again done by VF) with
this same LSN and next pointing to 524. And so this must have been the
original leftmost block before 524 was promoted to that status.

To summarize, as I see it - the zeroed out block 523 should have been
the second left-most leaf and should have pointed out to 522. Thus
re-establishing the index chain

524 -> 523 -> 522 -> 277 -> ...

Phew, I hope the above made some sense and was decipherable. Looking
at my png alongside might help a bit more too. Also maybe some bug is
indeed hidden in the guts of VF.

> Was there a machine restart in the picture as well?

I don't think so.

Regards,
Nikhils


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-12 11:59:53
Message-ID: AANLkTi=+iMdEGtUm5xyJys_Kw3YAw0ircbzWvO7Zv+yU@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 6:02 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> I'll send you a perl program we wrote for a customer to check for
> strange issues in btrees.  Please give it a spin; it may give you more
> clues.  If you find additional checks to add, please let me know!

I have also, coincidentally, encountered corruption of a system
catalog index -- 8.3.11 -- I have saved the file for forensics. Is it
possible that I also receive a copy of this program?

--
fdr


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-12 16:13:19
Message-ID: AANLkTin95+5VuiWpXYVFSQmn=nPJ11KO786z267OV0of@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 12, 2011 at 3:06 AM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> Live 522's      (LSN: logid 29, recoff 0xd1fade3c) previous points to
> the zeroed out 523 block. Note that this seems to be latest LSN in the
> data file.
>

So do you have logs from the server when it was restarted? It should
say how far it recovered before it started up

--
greg


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-13 23:34:05
Message-ID: AANLkTimiviTZ0EKo+dwLDt136Vf5DLBshkRnm1WBVXrO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Live 522's      (LSN: logid 29, recoff 0xd1fade3c) previous points to
>> the zeroed out 523 block. Note that this seems to be latest LSN in the
>> data file.
>>
>
> So do you have logs from the server when it was restarted? It should
> say how far it recovered before it started up
>

Unfortunately, no..

Regards,
Nikhils


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-14 23:42:04
Message-ID: AANLkTi=T+n+CiqbxH5zvfVd7FVFKWTGvSSzcke8dqXYM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Daniel,

>
> I have also, coincidentally, encountered corruption of a system
> catalog index -- 8.3.11 -- I have saved the file for forensics.  Is it
> possible that I also receive a copy of this program?
>

Will it be possible for you to share the file/logs off-list with me? I
can also try to do some analysis to compare if the situation is
similar to the one we are faced with right now and report it back here
if so.

TIA,
Nikhils


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-16 11:51:00
Message-ID: AANLkTi=fPsoTmAYeYk9tPhLYUgwWWJqPg4TPV-mcO9dn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> To summarize, as I see it - the zeroed out block 523 should have been
> the second left-most leaf and should have pointed out to 522. Thus
> re-establishing the index chain
>
> 524 -> 523 -> 522 -> 277 -> ...
>
>> Was there a machine restart in the picture as well?
>

It seems there might have been a machine restart involved too. So I
guess even WAL writing could have been impacted.

But even if VF was ongoing at the time of restart, the WAL replay on
restart should not do anything since this will be a non-committed
transaction?

Also I was looking at ReadRecord and saw that it logs a message for
failed CRC blocks but the WAL replay just stops at that point since it
returns a NULL. Is there a way to find out if more blocks follow in
the wake of this failed block (should be a matter of calling
ReadRecord with NULL as a first argument I think)? If so maybe we can
warn further that error was encountered in the middle of WAL replay.
However the last block too could be CRC check-fail candidate...

BTW, is there a possibility to encounter trailing blocks with CRC
failures regularly? For transactions that were ongoing at the time of
shutdown and did not get a chance to commit or WAL log properly?

Regards,
Nikhils


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-16 12:06:42
Message-ID: AANLkTi=yC=ePCx5zD-___W0BR9OpoAvTPtvkUTw1DX36@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 16, 2011 at 7:51 AM, Nikhil Sontakke
<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> Hi,
>
>> To summarize, as I see it - the zeroed out block 523 should have been
>> the second left-most leaf and should have pointed out to 522. Thus
>> re-establishing the index chain
>>
>> 524 -> 523 -> 522 -> 277 -> ...
>>
>>> Was there a machine restart in the picture as well?
>>
>
> It seems there might have been a machine restart involved too. So I
> guess even WAL writing could have been impacted.
>
> But even if VF was ongoing at the time of restart, the WAL replay on
> restart should not do anything since this will be a non-committed
> transaction?

That's not how it works. Replaying an uncommitted transaction
shouldn't result in any user-visible changes, but it still does stuff.

> Also I was looking at ReadRecord and saw that it logs a message for
> failed CRC blocks but the WAL replay just stops at that point since it
> returns a NULL. Is there a way to find out if more blocks follow in
> the wake of this failed block (should be a matter of calling
> ReadRecord with NULL as a first argument I think)? If so maybe we can
> warn further that error was encountered in the middle of WAL replay.
> However the last block too could be CRC check-fail candidate...

In general, when we WAL-log, we're writing over a previous WAL segment
that's been recycled. A failed CRC is indistinguishable from
end-of-WAL, because we expect there to be arbitrary garbage bytes in
the file after the end of WAL position.

> BTW, is there a possibility to encounter trailing blocks with CRC
> failures regularly? For transactions that were ongoing at the time of
> shutdown and did not get a chance to commit or WAL log properly?

Well you might have a torn page if there was a *system* crash in the
middle of recovery, but in theory even that shouldn't break anything,
because the system shouldn't rely on the fsync being complete until it
actually is. Of course, as you mentioned earlier, it's not impossible
there's a bug in the recovery code. But if an OS crash is involved,
another possibility is that something went wrong with the fsync -
maybe there's a lying writeback cache between PG and the platter, for
example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-16 12:27:31
Message-ID: AANLkTinEbdMUrNoc+_hw7GNdLNDTtx5sVPS2wAgdzqVj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> Of course, as you mentioned earlier, it's not impossible
> there's a bug in the recovery code.

Yeah, I was looking at the repair_frag function in 8.3.13 (yup it's
ugly!) and found out that the normal ExecInsertIndexTuples call is
used to insert the index entries. That is standard index code used
everywhere. So btree WAL bugs in this code path should be pretty rare
I would think..

> But if an OS crash is involved,
> another possibility is that something went wrong with the fsync -
> maybe there's a lying writeback cache between PG and the platter, for
> example.
>

Yup, plan to re-confirm this too.

Thanks Robert!

Regards,
Nikhils


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-16 13:26:29
Message-ID: 1300281862-sup-4259@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Nikhil Sontakke's message of mié mar 16 08:51:00 -0300 2011:
> Hi,
>
> > To summarize, as I see it - the zeroed out block 523 should have been
> > the second left-most leaf and should have pointed out to 522. Thus
> > re-establishing the index chain
> >
> > 524 -> 523 -> 522 -> 277 -> ...
> >
> >> Was there a machine restart in the picture as well?
> >
>
> It seems there might have been a machine restart involved too. So I
> guess even WAL writing could have been impacted.

Maybe the filesystem decided to fill a couple of FS pages (one PG page)
with zeroes on recovery due to believing that it wasn't fsynced at the
time of the crash.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support