Re: strange nbtree corruption report

Lists: pgsql-hackers
From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: strange nbtree corruption report
Date: 2011-11-21 23:00:21
Message-ID: 1321915576-sup-7558@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hi,

We got a very strange nbtree corruption report some time ago. This was
a btree index on a vey high churn table -- entries are updated and
deleted very quickly, so the index grows very large and also shrinks
quickly (AFAICT this is a work queue of sorts).

The most strange thing of all is that there was this error:

ERROR: left link changed unexpectedly in block 3378 of index "index_name"
CONTEXT: automatic vacuum of table "table_name"

This was reported not once, but several dozens of times, by each new
autovacuum worker that tried to vacuum the table.

As far as I can see, there is just no way for this to happen ... much
less happen repeatedly. I thought it might be related to concurrent
insertions somehow managing to split the page under deletion very
quickly (given the load these systems are under, this is plausible).
But I can't find how.

(There were other error reports of btree indexes going awry here, such
as "ERROR: right sibling's left-link doesn't match: block 67 links to
2118 instead of expected 2228 in index "pg_depend_depender_index").

These guys are running 8.3.14 here, and this is a Londiste slave
database. Sadly, it seems that the index files in our case are gone
now.

I see three independent reports of this error message in the archives
(Ulrich Wisser, Mark Kirkwood, Gabriel Ferro), but no one seems to have
carried the investigation forward enough to discover what is exactly
going wrong.

Any thoughts about this?

--
Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>


From: Noah Misch <noah(at)leadboat(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-22 03:17:45
Message-ID: 20111122031745.GA10556@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 21, 2011 at 08:00:21PM -0300, Alvaro Herrera wrote:
> We got a very strange nbtree corruption report some time ago. This was
> a btree index on a vey high churn table -- entries are updated and
> deleted very quickly, so the index grows very large and also shrinks
> quickly (AFAICT this is a work queue of sorts).
>
> The most strange thing of all is that there was this error:
>
> ERROR: left link changed unexpectedly in block 3378 of index "index_name"
> CONTEXT: automatic vacuum of table "table_name"

Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
had a thought that the logic in _bt_page_recyclable() was obsolete as of the
introduction (in 8.3) of xid-free read-only transactions. A transaction
without a persistent xid does not hold back RecentXmin, so how could waiting
for a RecentXmin window to pass prove that no scan still holds a link to the
page? Similarly, running VACUUMs do not hold back RecentXmin. I had made a
note to try to reproduce a concrete problem, but I haven't tried yet.

> This was reported not once, but several dozens of times, by each new
> autovacuum worker that tried to vacuum the table.
>
> As far as I can see, there is just no way for this to happen ... much
> less happen repeatedly. I thought it might be related to concurrent
> insertions somehow managing to split the page under deletion very
> quickly (given the load these systems are under, this is plausible).
> But I can't find how.

Yes, nothing comes to mind explaining that duration of persistence.

nm


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-22 03:43:23
Message-ID: 1189.1321933403@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
> had a thought that the logic in _bt_page_recyclable() was obsolete as of the
> introduction (in 8.3) of xid-free read-only transactions. A transaction
> without a persistent xid does not hold back RecentXmin, so how could waiting
> for a RecentXmin window to pass prove that no scan still holds a link to the
> page? Similarly, running VACUUMs do not hold back RecentXmin.

Uh, sure they do. It's their advertised snapshot xmin that counts, not
their own xid (if any).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-22 04:14:33
Message-ID: 1731.1321935273@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> We got a very strange nbtree corruption report some time ago. This was
> a btree index on a vey high churn table -- entries are updated and
> deleted very quickly, so the index grows very large and also shrinks
> quickly (AFAICT this is a work queue of sorts).

> The most strange thing of all is that there was this error:

> ERROR: left link changed unexpectedly in block 3378 of index "index_name"
> CONTEXT: automatic vacuum of table "table_name"

> This was reported not once, but several dozens of times, by each new
> autovacuum worker that tried to vacuum the table.

> As far as I can see, there is just no way for this to happen ... much
> less happen repeatedly.

It's not hard to believe that that would happen repeatedly given a
corrupted set of sibling links, eg deletable page A links left to page
B, which links right to C, which links right to A. The question is how
the index got into such a state. A dropped update during a page split
would explain it (ie, B used to be A's left sibling, then at some point
B got split into B and C, but A's left-link never got updated on disk).
I wonder how reliable their disk+filesystem is ...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-22 04:17:40
Message-ID: 1811.1321935460@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
>> Just a suspicion ... when looking at the B-tree page reclamation algorithm, I
>> had a thought that the logic in _bt_page_recyclable() was obsolete as of the
>> introduction (in 8.3) of xid-free read-only transactions. A transaction
>> without a persistent xid does not hold back RecentXmin, so how could waiting
>> for a RecentXmin window to pass prove that no scan still holds a link to the
>> page? Similarly, running VACUUMs do not hold back RecentXmin.

> Uh, sure they do. It's their advertised snapshot xmin that counts, not
> their own xid (if any).

No, wait a second, I think you're right. The existing mechanism should
protect against transactions that might be updating the btree, so the
worst possible consequences can't happen; but it seems possible that a
read-only transaction in flight to the page could get confused and give
wrong answers. That would only explain transient failures not persistent
ones, though.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-22 17:23:17
Message-ID: 1321981982-sup-2149@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:

> > ERROR: left link changed unexpectedly in block 3378 of index "index_name"
> > CONTEXT: automatic vacuum of table "table_name"
>
> > This was reported not once, but several dozens of times, by each new
> > autovacuum worker that tried to vacuum the table.
>
> > As far as I can see, there is just no way for this to happen ... much
> > less happen repeatedly.
>
> It's not hard to believe that that would happen repeatedly given a
> corrupted set of sibling links, eg deletable page A links left to page
> B, which links right to C, which links right to A. The question is how
> the index got into such a state. A dropped update during a page split
> would explain it (ie, B used to be A's left sibling, then at some point
> B got split into B and C, but A's left-link never got updated on disk).
> I wonder how reliable their disk+filesystem is ...

Well, there are no other signs of random data corruption, such as toast
pointers getting corrupted which is the number one symptom showing up
when underlying storage is flaky. However, it may be possible that
there was a transient storage problem which only affected this one page;
if this persisted in the way you describe, it might well explain these
symptoms.

Another thing I noticed is that there was corruption in heap pages (not
the same server, though; it was a different Londiste slave). This was
even more strange; the pages would be completely fine, except the first
six words corresponding to the page header; they would be all zeros.
When filled with valid-looking data (mostly I copied the bytes from
neighbor pages), the rest of the page would decode fine.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-28 21:44:05
Message-ID: 1322516294-sup-4592@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Tom Lane's message of mar nov 22 01:14:33 -0300 2011:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > We got a very strange nbtree corruption report some time ago. This was
> > a btree index on a vey high churn table -- entries are updated and
> > deleted very quickly, so the index grows very large and also shrinks
> > quickly (AFAICT this is a work queue of sorts).
>
> > The most strange thing of all is that there was this error:
>
> > ERROR: left link changed unexpectedly in block 3378 of index "index_name"
> > CONTEXT: automatic vacuum of table "table_name"
>
> > This was reported not once, but several dozens of times, by each new
> > autovacuum worker that tried to vacuum the table.
>
> > As far as I can see, there is just no way for this to happen ... much
> > less happen repeatedly.
>
> It's not hard to believe that that would happen repeatedly given a
> corrupted set of sibling links, eg deletable page A links left to page
> B, which links right to C, which links right to A. The question is how
> the index got into such a state. A dropped update during a page split
> would explain it (ie, B used to be A's left sibling, then at some point
> B got split into B and C, but A's left-link never got updated on disk).
> I wonder how reliable their disk+filesystem is ...

While summarizing this, a (relatively) frequent problem with unique
indexes came to my mind: there would be a UNIQUE index but when the
admin tries to rebuild it for whatever reason, duplicate values are
found. We've seen dozens of reports of this kind of problem (in the
pgsql lists I mean -- I don't think we've seen this problem in this
customer's servers). I wonder if it's related, because it seems pretty
much the same mechanism: sometimes, a btree index insert would be
randomly forgotten (its page write lost in vacuum, so to speak), and
thus when the second heap item comes along, there's no entry in the
index and the insert completes, and there you have your duplicate value.

I wonder if it would be worthwhile to build some sort of tool to scan
the heap and ensure that there are index entries for all heap items,
just to test the hypothesis. Not that this would enlighten on the
source of the actual problem.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Jim Nasby <jim(at)nasby(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-29 01:07:28
Message-ID: A3C798C4-998C-489C-B438-FB338F3ABAF9@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote:
> I wonder if it would be worthwhile to build some sort of tool to scan
> the heap and ensure that there are index entries for all heap items,
> just to test the hypothesis. Not that this would enlighten on the
> source of the actual problem.

There was a project to create a sanity-checker for Postgres databases... did that ever go anywhere? It seems like this would be a good addition for that tool, if it exists.

In either case, I am all for better capabilities to detect data problems (I'm looking at you, block checksum project! ;)
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-29 01:11:09
Message-ID: 201111290111.pAT1B9D24857@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby wrote:
> On Nov 28, 2011, at 3:44 PM, Alvaro Herrera wrote:
> > I wonder if it would be worthwhile to build some sort of tool to scan
> > the heap and ensure that there are index entries for all heap items,
> > just to test the hypothesis. Not that this would enlighten on the
> > source of the actual problem.
>
> There was a project to create a sanity-checker for Postgres databases...
> did that ever go anywhere? It seems like this would be a good addition
> for that tool, if it exists.

Not that I know of.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-29 01:40:01
Message-ID: 1637.1322530801@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I wonder if it would be worthwhile to build some sort of tool to scan
> the heap and ensure that there are index entries for all heap items,
> just to test the hypothesis. Not that this would enlighten on the
> source of the actual problem.

Seems like the hypothesis could be proven or disproven just by counting
the heap and index entries while the DB is otherwise idle. It used to
be that VACUUM VERBOSE was sufficient for that sort of cross-check ...
but I'm not totally sure what push-ups are required nowadays to prevent
it from deciding that it's smarter than you are so it needn't scan the
whole table. Is VACUUM FREEZE VERBOSE still trustworthy for this?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-29 13:08:45
Message-ID: CA+TgmoYbnA6nmvaftiiiAuaUiSGZPpGX0Zy-L8YpWZm1xjLBaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 28, 2011 at 4:44 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> I wonder if it's related, because it seems pretty
> much the same mechanism: sometimes, a btree index insert would be
> randomly forgotten (its page write lost in vacuum, so to speak), ...

Groan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange nbtree corruption report
Date: 2011-11-29 13:14:23
Message-ID: CA+TgmoZNo3qo5jTgeGUk8TxOfaBWBopJBcUJVkJa9vUaNP6LBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 28, 2011 at 8:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> I wonder if it would be worthwhile to build some sort of tool to scan
>> the heap and ensure that there are index entries for all heap items,
>> just to test the hypothesis.  Not that this would enlighten on the
>> source of the actual problem.
>
> Seems like the hypothesis could be proven or disproven just by counting
> the heap and index entries while the DB is otherwise idle.  It used to
> be that VACUUM VERBOSE was sufficient for that sort of cross-check ...
> but I'm not totally sure what push-ups are required nowadays to prevent
> it from deciding that it's smarter than you are so it needn't scan the
> whole table.  Is VACUUM FREEZE VERBOSE still trustworthy for this?

Yes. It will scan the whole table if relfrozenxid is too far back,
and FREEZE defines "too far back" to 0 transaction IDs, which is
always satisfied.

It might be useful to add an option to VACUUM to scan the whole table
without otherwise altering the behavior, in case you distrust the
visibility map or whatever but don't care about freezing anything.
The obvious name for the option would be "full", but since that's
already taken we'd need to invent something different. PARANOID?
ALL? SCAN_ALL? INEFFICIENTLY?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company