Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)

Lists: pgsql-bugspgsql-hackers
From: "Hitesh Bhambhani" <hiteshb(at)asg(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-04 21:47:12
Message-ID: 201008042147.o74LlCge057659@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


The following bug has been logged online:

Bug reference: 5599
Logged by: Hitesh Bhambhani
Email address: hiteshb(at)asg(dot)com
PostgreSQL version: 8.2.9-1
Operating system: Microsoft Windows Server 2003, Enterprise Edition
Description: Vacuum fails due to index corruption issues
Details:

Hi,

We are seeing a problem in our application where the table indexes get
corrupted. This application is a Java Webapp with postgre as the backend.
The Webapp kicks off Vacuum at regular intervals. After running the
application for a while, one of our customers noted that the Vacuum fails
and Webapp gets very slow.

A re-index of the full database works fine and resolves the issue. But it
re-occurs within a day.

Based on some logs in the Webapp I can see that there were some errors in
truncating relations. Once those errors disappear the index corruption
errors start. I'm not sure if there is a connection here.

Here is a sample log message from the Webapp that shows the truncate error:

2010-07-08 06:29:54,641 WARN DefaultQuartzScheduler_Worker-4
maintenance.PostgreSqlVacuumer:32 - runVacuumFull(): running VACUUM FULL
VERBOSE
2010-07-08 06:29:56,672 ERROR DefaultQuartzScheduler_Worker-4
core.JobRunShell:211 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled
Exception:
org.springframework.jdbc.BadSqlGrammarException: Hibernate-related JDBC
operation;
bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
denied

After a couple of such truncate errors the Vacuum starts failing due to
'failed to re-find parent key in index', as seen in sample error log below:

2010-07-08 06:44:56,060 ERROR DefaultQuartzScheduler_Worker-1
core.JobRunShell:2
11 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled Exception:
org.springframework.jdbc.UncategorizedSQLException: Hibernate-related JDBC
operation;
uncategorized SQLException for SQL []; SQL state [XX000]; error code [0];
ERROR: failed to re-find parent key in index "pmoinstance_idx_pmotypeid" for
deletion target page 30;
nested exception is org.postgresql.util.PSQLException: ERROR: failed to
re-find parent key in index "pmoinstance_idx_pmotypeid" for deletion target
page 30

Here the index "pmoinstance_idx_pmotypeid" is one of several application
specific indexes.

Once this index corruption issue occurs, the Vacuum job keeps failing until
a re-index is done. In the long run, we don't want to keep re-indexing the
database as a scheduled job so we would like your help to get to the bottom
of this.

So how can we avoid these index corruption errors and are there any known
causes?

Also, please let me know if there is a direct link between the truncate
relation errors that I saw preceded the index corruption errors?

At the time these Webapp logs were collected, the database was set to
produce verbose logging so I don't have database logs, sorry. Please do let
me know what other information I can provide to help you diagnose this
situation.

Thanks for your time.

Regards,
Hitesh


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Hitesh Bhambhani <hiteshb(at)asg(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 01:35:12
Message-ID: AANLkTi=QHZ4R905JwP8ANZwu1vGoK_WksN++N52M-Ak7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Wed, Aug 4, 2010 at 10:47 PM, Hitesh Bhambhani <hiteshb(at)asg(dot)com> wrote:
>

> PostgreSQL version: 8.2.9-1

Firstly, the current release of 8.2 is 8.2.17. There are a long list
of bugs fixed in those intervening releases including one involving
vacuum truncating relations. I don't think it's the same problem but I
would recommend updating immediately to 8.2.17.

> 2010-07-08 06:29:54,641  WARN DefaultQuartzScheduler_Worker-4
> maintenance.PostgreSqlVacuumer:32 - runVacuumFull(): running VACUUM FULL
> VERBOSE

Secondly we don't recommend running VACUUM FULL routinely. It should
only be necessary in extraordinary circumstances. Normally a plain
VACUUM (or VACUUM ANALYZE or VACUUM VERBOSE) should be sufficient as
long as it's being run regularly. Regular VACUUM without the "FULL"
has much less impact on the system.

> 2010-07-08 06:29:56,672 ERROR DefaultQuartzScheduler_Worker-4
> core.JobRunShell:211 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled
> Exception:
> org.springframework.jdbc.BadSqlGrammarException: Hibernate-related JDBC
> operation;
> bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
> ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
> denied

"Permission denied" smells like a Windows problem with concurrent file
operations. Are you sure you're not running any anti-virus software or
backup software which could have these files open and prevent Postgres
from performing regular file operations on its files? Many people have
reported other problems with anti-virus software in particular.

> Also, please let me know if there is a direct link between the truncate
> relation errors that I saw preceded the index corruption errors?

I'm a bit surprised at that, but I haven't checked what exactly
happens when an error occurs truncating files myself so I'm not sure
if I should be or not.

--
greg


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Hitesh Bhambhani <hiteshb(at)asg(dot)com>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 03:29:31
Message-ID: 1280978913-sup-5475@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Hitesh Bhambhani's message of mié ago 04 17:47:12 -0400 2010:

> Based on some logs in the Webapp I can see that there were some errors in
> truncating relations. Once those errors disappear the index corruption
> errors start. I'm not sure if there is a connection here.

There probably is. What kind of relation are the ones unable to
truncate? Please see in pg_class where relfilenode = '41274' in this
case:

> bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
> ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
> denied

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 11:55:44
Message-ID: 8B7319E089E48047A0B688CE9CF54FD855C3C90D@USRES8MAIL.asg.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Greg, thanks for your answers.

My comments below...

> From: gsstark(at)gmail(dot)com [mailto:gsstark(at)gmail(dot)com] On Behalf Of Greg Stark
> Sent: Wednesday, August 04, 2010 9:35 PM
> Firstly, the current release of 8.2 is 8.2.17. There are a long list of bugs
> fixed in those intervening releases including one involving vacuum truncating
> relations. I don't think it's the same problem but I would recommend updating
> immediately to 8.2.17.
>
[HiteshB] I have noted your recommendation and will work with our Product Management to upgrade to the latest and greatest. Although we can't change the version that the customer has installed (8.2.9-1).

>
>
> Secondly we don't recommend running VACUUM FULL routinely. It should only be
> necessary in extraordinary circumstances. Normally a plain VACUUM (or VACUUM
> ANALYZE or VACUUM VERBOSE) should be sufficient as long as it's being run
> regularly. Regular VACUUM without the "FULL"
> has much less impact on the system.
>
[HiteshB] Point noted. The decision to use VACUUM FULL is something we'll re-examine.
Could you give an example of what an 'extraordinary circumstance' would be?

>
> "Permission denied" smells like a Windows problem with concurrent file
> operations. Are you sure you're not running any anti-virus software or backup
> software which could have these files open and prevent Postgres from
> performing regular file operations on its files? Many people have reported
> other problems with anti-virus software in particular.
>
[HiteshB] We have worked with the customer to exclude the postgre directory from their antivirus scans. Hopefully we won't see this Permission issue again.

Thanks again.
Regards,
Hitesh


From: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 12:04:54
Message-ID: 8B7319E089E48047A0B688CE9CF54FD855C3C916@USRES8MAIL.asg.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro, thanks for your response.
My comments below...

> From: Alvaro Herrera [mailto:alvherre(at)commandprompt(dot)com]
> Sent: Wednesday, August 04, 2010 11:30 PM
> There probably is. What kind of relation are the ones unable to truncate?
> Please see in pg_class where relfilenode = '41274' in this
> case:
>
[HiteshB] the relation is called pmoinstance_idx_pmotypeid. This is an index on a table created by our product.
The definition for this index is:
CREATE INDEX pmoinstance_idx_pmotypeid ON pmoinstances USING btree (pmotype_id);

Regards,
Hitesh


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 12:24:42
Message-ID: AANLkTimHn-bkaiKkcBVVsY2EqU_EnzQkyBe93EZEFFdn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Aug 5, 2010 at 12:55 PM, Hitesh Bhambhani
<hitesh(dot)bhambhani(at)asg(dot)com> wrote:
> [HiteshB] I have noted your recommendation and will work with our Product Management to upgrade to the latest and greatest. Although we can't change the version that the customer has installed (8.2.9-1).
>

The latest and greatest is 8.4.x which has lots of new features
including improvements to VACUUM. 9.0 will be out soon and will
replace VACUUM FULL entirely too.

However 8.2.17 is the same as the version you're using except with
dozens of known bugs fixed and security holes patched.

--
greg


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Hitesh Bhambhani" <hitesh(dot)bhambhani(at)asg(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 13:56:32
Message-ID: 4C5A7CC00200002500034232@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com> wrote:

> Could you give an example of what an 'extraordinary circumstance'
> would be?

Normal vacuums will remove old tuples (versions of rows) which can
no longer be seen by any transaction, and make that space available
for re-use within the PostgreSQL files. It will not normally give
space back to the OS, but that's usually a *good* thing, because
normally the space will soon be needed again by PostgreSQL, and it
would be less efficient to constantly be giving space back and
allocating it again.

If you neglect to vacuum aggressively enough, or do a mass UPDATE or
DELETE which affects a large percentage of your rows, without
anticipating that you will need that space again soon, you might
want to do aggressive maintenance to shrink the PostgreSQL files.
VACUUM FULL will move tuples around within the table to free up
space at the end so that it can be released. But wait -- you
probably *still* don't want to use VACUUM FULL, because it is *very*
slow and will bloat your indexes, requiring a REINDEX to restore
decent performance. CLUSTER will rewrite the table without dead
space and will rebuild the indexes -- usually much faster than
VACUUM FULL. But CLUSTER needs room for a second copy of the table
in order to copy it. If you have a very bloated table which you
want to shrink and you don't have room for a second copy of it,
*that* is the time to consider VACUUM FULL (usually followed by
REINDEX).

If you ever find you *do* need to run VACUUM FULL, you probably need
to re-evaluate your maintenance procedures to see how you can avoid
having to do it again.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 15:06:57
Message-ID: 1880.1281020817@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com> writes:
>> From: Alvaro Herrera [mailto:alvherre(at)commandprompt(dot)com]
>> Sent: Wednesday, August 04, 2010 11:30 PM
>> There probably is. What kind of relation are the ones unable to truncate?
>> Please see in pg_class where relfilenode = '41274' in this
>> case:
>>
> [HiteshB] the relation is called pmoinstance_idx_pmotypeid.

So the relation that couldn't be truncated is indeed the same one
complained of in the later message.

What it looks like to me is that page 30 was deleted, and then vacuum
tried to truncate it off the index altogether, and that failed because
of Windows randomness, and then later it's trying to delete page 30
again. Which naturally fails because the parent downlink entry is long
gone. But it's odd that it tries to delete page 30 twice. The page
should have been in BTP_DELETED state before the truncate attempt, and
therefore should still be dead later, so why's it trying again?

[ thinks for a bit ... ] I do see a way that could happen. The page
could still be live on disk when we reach smgrtruncate(): the update
to BTP_DELETED state might only exist in a dirty shared buffer. And
lookee here what smgrtruncate does:

/*
* Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
* just drop them without bothering to write the contents.
*/
DropRelFileNodeBuffers(reln->smgr_rnode, forknum, isTemp, nblocks);

So we throw away the BTP_DELETED state update without ever writing it to
disk, and then when the truncate fails, the old page state is still out
there, ready to confuse us later.

Seems like we need to think harder about recovering from a truncate
failure. A few random ideas:

1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.

2. Treat truncation failure as a PANIC condition; then WAL replay will
fix things. Unpleasant. Actually, worse than unpleasant: I think what
the comment in RelationTruncate() is pointing out is that if the
failure is repeatable, we'd fail to recover at all.

3. Don't discard the dirty buffers until after successfully truncating.
The hard part here is to be sure no other process (like bgwriter) will
try to write them in between. I don't see a way to guarantee that,
at least not without interlock infrastructure that doesn't exist today.

And see also that comment in RelationTruncate(). Seems like the whole
problem of coping with truncation failure needs more thought than we've
given it.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 16:28:20
Message-ID: 1281025234-sup-1704@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:

> Seems like we need to think harder about recovering from a truncate
> failure. A few random ideas:

Ugh.

> 1. Write the dirty buffers before dropping them. Kind of ugly from a
> performance viewpoint, but simple and safe.

I think "simple" is good, considering that this code is gone in 9.0 and
HEAD. IMHO investing too much effort on this problem is not worth it.

Perhaps it'd be good to come up with a better solution for HEAD:

> 3. Don't discard the dirty buffers until after successfully truncating.
> The hard part here is to be sure no other process (like bgwriter) will
> try to write them in between. I don't see a way to guarantee that,
> at least not without interlock infrastructure that doesn't exist today.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 16:36:24
Message-ID: 16374.1281026184@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:
>> 1. Write the dirty buffers before dropping them. Kind of ugly from a
>> performance viewpoint, but simple and safe.

> I think "simple" is good, considering that this code is gone in 9.0 and
> HEAD. IMHO investing too much effort on this problem is not worth it.

Gone? Looks like it's still there to me.

> Perhaps it'd be good to come up with a better solution for HEAD:

Yeah, the panic-on-replay aspect is troublesome. I feel like we need a
rethink here. But I agree that solution #1 is the only one that feels
safe enough for backpatching.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 16:57:46
Message-ID: 1281027356-sup-6865@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Tom Lane's message of jue ago 05 12:36:24 -0400 2010:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:
> >> 1. Write the dirty buffers before dropping them. Kind of ugly from a
> >> performance viewpoint, but simple and safe.
>
> > I think "simple" is good, considering that this code is gone in 9.0 and
> > HEAD. IMHO investing too much effort on this problem is not worth it.
>
> Gone? Looks like it's still there to me.

I mean the btree code that does the truncation on vacuum full is
truncated. There are other uses for truncation, but it doesn't look to
that they are as problematic ... or are they?

Hmm, I guess truncation of heap on lazy vacuum is still a problem
precisely because page compaction will be forgotten.

> > Perhaps it'd be good to come up with a better solution for HEAD:
>
> Yeah, the panic-on-replay aspect is troublesome. I feel like we need a
> rethink here. But I agree that solution #1 is the only one that feels
> safe enough for backpatching.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 17:19:41
Message-ID: 23590.1281028781@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of jue ago 05 12:36:24 -0400 2010:
>> Gone? Looks like it's still there to me.

> I mean the btree code that does the truncation on vacuum full is
> truncated. There are other uses for truncation, but it doesn't look to
> that they are as problematic ... or are they?

I think what Hitesh reported is a special case of a generic problem.

> Hmm, I guess truncation of heap on lazy vacuum is still a problem
> precisely because page compaction will be forgotten.

Page compaction is the least of it :-(

Imagine that we have some rows at the end of a table, we delete them,
we vacuum before the next checkpoint. Vacuum decides it can now
truncate away the last pages, but fails to do so. The original page
state is still on disk, which means we have lost the fact of the
deletion --- the rows are now effectively live again, though their
index entries are probably gone.

In any case, the removal of VACUUM FULL didn't completely disable
shrinking of btree indexes did it? I don't recall having removed
that.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 17:41:01
Message-ID: 1281029938-sup-3072@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Tom Lane's message of jue ago 05 13:19:41 -0400 2010:

> In any case, the removal of VACUUM FULL didn't completely disable
> shrinking of btree indexes did it? I don't recall having removed
> that.

I see no call to RelationTruncate in the btvacuumscan code, but then it
was only called in vacuum full before. I'm not sure how it worked
previously with only lazy vacuum. Did we simply leave the pages as free
for a later extension?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 18:01:15
Message-ID: 25994.1281031275@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of jue ago 05 13:19:41 -0400 2010:
>> In any case, the removal of VACUUM FULL didn't completely disable
>> shrinking of btree indexes did it? I don't recall having removed
>> that.

> I see no call to RelationTruncate in the btvacuumscan code, but then it
> was only called in vacuum full before. I'm not sure how it worked
> previously with only lazy vacuum. Did we simply leave the pages as free
> for a later extension?

You're right, I misremembered. That code is just plain gone in 9.0:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtree.c.diff?r1=1.174;r2=1.175;f=h

Still, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was thinking.
Maybe write-the-buffers-first is a sufficient longterm solution.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 18:28:35
Message-ID: 1281032780-sup-1394@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:

> You're right, I misremembered. That code is just plain gone in 9.0:
> http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtree.c.diff?r1=1.174;r2=1.175;f=h
>
> Still, we have a live issue with heap truncation during plain VACUUM.
> However, the scope of the problem seems a lot less than I was thinking.

The scope is further reduced by the fact that this only seems to happen
on Windows, and then only when the antivirus is messing around with the
files.

> Maybe write-the-buffers-first is a sufficient longterm solution.

Yeah, perhaps it is, though it's a pity that a single platform problem
is going to slow down everyone else.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 20:08:36
Message-ID: 4C5B1A44.1080504@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 05/08/10 21:28, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:
>> Maybe write-the-buffers-first is a sufficient longterm solution.
>
> Yeah, perhaps it is, though it's a pity that a single platform problem
> is going to slow down everyone else.

How about performing the truncate as usual, but if it fails, fill the
truncated portion with zeros instead? Zeroed pages should be handled
gracefully, and this would be a very non-invasive fix. Now if the
write() to zero the pages fails too, I guess we'll have to panic, but
that's not much different from flushing the dirty pages first.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 20:29:26
Message-ID: 29285.1281040166@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 05/08/10 21:28, Alvaro Herrera wrote:
>> Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:
>>> Maybe write-the-buffers-first is a sufficient longterm solution.
>>
>> Yeah, perhaps it is, though it's a pity that a single platform problem
>> is going to slow down everyone else.

> How about performing the truncate as usual, but if it fails, fill the
> truncated portion with zeros instead?

This is blithely assuming that you're going to be able to do anything at
all to the file after having failed to truncate it. The permissions
problem might be persistent, or you might crash, or several other
things. If we could do this safely, we could also do the original #3
solution (write the dirty buffers for the to-be-truncated block range
only after failing to truncate).

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-06 02:44:30
Message-ID: AANLkTikwnU3g=G9gwZ7r_Vkg4R3hrdL9Eie0+18ZK-Wk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Thu, Aug 5, 2010 at 7:28 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>
> The scope is further reduced by the fact that this only seems to happen
> on Windows, and then only when the antivirus is messing around with the
> files.

So I suspect this could be triggered lots of ways. Imagine a ZFS
volume that runs out of space temporarily. Even truncate would need
additional blocks to replace the meta information. Or a network
filesystem like AFS where your kerberos tickets have expired and you
get a permission failure until they've been renewed. Or, in the case
of a very large table being truncated I suspect there's a
CHECK_FOR_INTERRUPTS lying around that can cancel the backend at the
wrong time.

It would be nice to have a regression test harness that caused system
calls to fail randomly -- the difficult part would be testing the
results.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 18:58:10
Message-ID: 3092.1281898690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

In the discussion of bug #5599 we pretty much agreed to do this:
> Seems like we need to think harder about recovering from a truncate
> failure. A few random ideas:
> 1. Write the dirty buffers before dropping them. Kind of ugly from a
> performance viewpoint, but simple and safe.

I looked at making this happen, and noted that DropRelFileNodeBuffers
is used both for the truncation case and for dropping relation buffers
during smgrdounlink. In the latter case, it's still appropriate to
drop dirty buffers without writing them, both for performance reasons
and because we don't really care about any errors: we have already
committed the relation DROP, and are not going to look at the file
contents again in any case. So this means that two different behaviors
are now required for truncation and dropping.

The cleanest fix is an API change to add a boolean write-or-not
parameter to DropRelFileNodeBuffers. That's what I want to do in HEAD
and 9.0, but I'm unsure whether it's a safe fix in the back branches.
Does anyone have an opinion whether it's likely that any third-party
code is calling DropRelFileNodeBuffers directly? If there is, then
changing its API in a minor release would be an unfriendly thing to do.
We could avoid that by some ugly expedient like inserting a second copy
of the function in back branches.

Comments?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 20:48:34
Message-ID: 4C6852A2.70503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On 15/08/10 21:58, Tom Lane wrote:
> Does anyone have an opinion whether it's likely that any third-party
> code is calling DropRelFileNodeBuffers directly?

I doubt it. External modules shouldn't be modifying relations at such a
low level.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 21:17:50
Message-ID: AANLkTimcna8WUNfEGpAzpp5PEJ5pX+zP072NJXWTaN-D@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Aug 15, 2010 at 2:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> In the discussion of bug #5599 we pretty much agreed to do this:
>> Seems like we need to think harder about recovering from a truncate
>> failure.  A few random ideas:
>> 1. Write the dirty buffers before dropping them.  Kind of ugly from a
>> performance viewpoint, but simple and safe.
>
> I looked at making this happen, and noted that DropRelFileNodeBuffers
> is used both for the truncation case and for dropping relation buffers
> during smgrdounlink.  In the latter case, it's still appropriate to
> drop dirty buffers without writing them, both for performance reasons
> and because we don't really care about any errors: we have already
> committed the relation DROP, and are not going to look at the file
> contents again in any case.  So this means that two different behaviors
> are now required for truncation and dropping.
>
> The cleanest fix is an API change to add a boolean write-or-not
> parameter to DropRelFileNodeBuffers.  That's what I want to do in HEAD
> and 9.0, but I'm unsure whether it's a safe fix in the back branches.
> Does anyone have an opinion whether it's likely that any third-party
> code is calling DropRelFileNodeBuffers directly?  If there is, then
> changing its API in a minor release would be an unfriendly thing to do.
> We could avoid that by some ugly expedient like inserting a second copy
> of the function in back branches.
>
> Comments?

I really hate this solution, because writing out data that we're about
to throw away just in case we can't actually throw it away seems like
a real waste from a performance standpoint. Could we avoid this
altogether by allocating a new relfilenode on truncate?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 21:17:54
Message-ID: AANLkTinQbJWb9=seuPpBf_oBAQDq_OiZhYUC6Rn8KdAd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Aug 15, 2010 at 9:48 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 15/08/10 21:58, Tom Lane wrote:
>>
>> Does anyone have an opinion whether it's likely that any third-party
>> code is calling DropRelFileNodeBuffers directly?
>
> I doubt it. External modules shouldn't be modifying relations at such a low
> level.

Really? What about an index access method?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 21:39:08
Message-ID: 5884.1281908348@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I really hate this solution, because writing out data that we're about
> to throw away just in case we can't actually throw it away seems like
> a real waste from a performance standpoint.

We went around on this already in the pgsql-bugs thread. I don't much
like it either, but there isn't any better solution that seems safe
enough to back-patch.

> Could we avoid this
> altogether by allocating a new relfilenode on truncate?

Then we'd have to copy all the data we *didn't* truncate, which is
hardly likely to be a win.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 21:42:43
Message-ID: 5947.1281908563@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Sun, Aug 15, 2010 at 9:48 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 15/08/10 21:58, Tom Lane wrote:
>>> Does anyone have an opinion whether it's likely that any third-party
>>> code is calling DropRelFileNodeBuffers directly?
>>
>> I doubt it. External modules shouldn't be modifying relations at such a low
>> level.

> Really? What about an index access method?

An index AM might have a reason to call smgrtruncate, but I can't see an
argument why it should need to call DropRelFileNodeBuffers directly.
The built-in AMs certainly never did.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-15 21:59:06
Message-ID: 6183.1281909546@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

I wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Could we avoid this
>> altogether by allocating a new relfilenode on truncate?

> Then we'd have to copy all the data we *didn't* truncate, which is
> hardly likely to be a win.

There is one thing we could do pretty easily. AFAICS there are
presently just two users of smgrtruncate: vacuumlazy.c and
heap_truncate. The latter is used primarily for ON COMMIT DELETE ROWS.
AFAICS we could still use drop-dirty-buffers semantics for that.
In event of a truncate failure, we'd have garbage data on disk,
but it doesn't matter because that data would represent rows inserted
by the transaction that was aborted by the truncation failure, so
later transactions would ignore it anyway. Now the indexes on such
a temp table are a bigger problem, but we have pretty fatal
heap-to-index consistency issues anyhow if one of the interrelated
truncates fails. We could probably minimize the danger if we did
things in the right order: truncate indexes to zero length, truncate
table to zero length, rebuild indexes as empty.

However, I'm not sure that we want to do that in the back branches,
because it would require adding write-or-drop booleans to some
higher-level functions like smgrtruncate, thus greatly increasing
the risk of breaking 3rd-party code.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-16 01:11:10
Message-ID: AANLkTinOSX8-EUr4P7EkY_oqeWeCaC2NRAj+f2J8=KWH@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Aug 15, 2010 at 5:39 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Could we avoid this
>> altogether by allocating a new relfilenode on truncate?
>
> Then we'd have to copy all the data we *didn't* truncate, which is
> hardly likely to be a win.

Oh, sorry. I was thinking we were talking about complete truncation
rather than partial truncation. I'm still pretty unhappy with the
proposed fix, though, because it gives up performance in a broad range
of cases to cater to an extremely narrow failure case. Considering
the rarity of the proposed problem, are we sure that it isn't better
to adopt a solution like what Heikki proposed? If truncation fails,
try to zero the pages; if that also fails, PANIC. I'm really
reluctant to back-patch a performance regression. Perhaps, as Greg
Stark says, there are a variety of ways that this can happen - but
they're all pretty rare, and seem to require a fairly substantial
amount of broken-ness. If we're in a situation where we can't
reliably update our disk files, it seems optimistic to assume that
keeping on running is going to be a whole lot better than PANICing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-16 01:25:54
Message-ID: 10048.1281921954@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Oh, sorry. I was thinking we were talking about complete truncation
> rather than partial truncation. I'm still pretty unhappy with the
> proposed fix, though, because it gives up performance in a broad range
> of cases to cater to an extremely narrow failure case.

It doesn't matter: broken is broken, and failure to recover from a
truncate() error is broken. You mustn't think that this is a
Windows-only problem.

> Considering
> the rarity of the proposed problem, are we sure that it isn't better
> to adopt a solution like what Heikki proposed? If truncation fails,
> try to zero the pages; if that also fails, PANIC.

... and PANIC is absolutely, entirely, 100% unacceptable here. I don't
think you understand the context. We've already written the truncate
action to WAL (as we must: WAL before data change). If we PANIC, that
means we'll PANIC again during WAL replay. So that solution means a
down, and perhaps unrecoverably broken, database. There's also the
little problem that "zero the page" is only a safe recovery action for
heap pages not index pages; smgrtruncate is certainly not entitled to
assume its working on a heap.

I think the performance argument is based on fear not facts anyway.
In practice, in modern installations, you're only going to be talking
about marginally more work in autovacuum. ISTM we should fix the bug,
in a simple/reliable/backpatchable way, and then anyone who's worried
about performance can measure the size of the problem, and try to
think of a workable solution if he doesn't like the results.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-16 01:58:13
Message-ID: AANLkTik3=YZTCcuZ_feJ0vrTdN1rN3gf4dR0Z6950wi0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Oh, sorry.  I was thinking we were talking about complete truncation
>> rather than partial truncation.  I'm still pretty unhappy with the
>> proposed fix, though, because it gives up performance in a broad range
>> of cases to cater to an extremely narrow failure case.
>
> It doesn't matter: broken is broken, and failure to recover from a
> truncate() error is broken.  You mustn't think that this is a
> Windows-only problem.
>
>> Considering
>> the rarity of the proposed problem, are we sure that it isn't better
>> to adopt a solution like what Heikki proposed?  If truncation fails,
>> try to zero the pages; if that also fails, PANIC.
>
> ... and PANIC is absolutely, entirely, 100% unacceptable here.  I don't
> think you understand the context.  We've already written the truncate
> action to WAL (as we must: WAL before data change).  If we PANIC, that
> means we'll PANIC again during WAL replay.  So that solution means a
> down, and perhaps unrecoverably broken, database.

All right, that would be bad.

> There's also the
> little problem that "zero the page" is only a safe recovery action for
> heap pages not index pages; smgrtruncate is certainly not entitled to
> assume its working on a heap.

This doesn't seem right to me; we don't WAL log relation extension,
even on an index.

> I think the performance argument is based on fear not facts anyway.
> In practice, in modern installations, you're only going to be talking
> about marginally more work in autovacuum.  ISTM we should fix the bug,
> in a simple/reliable/backpatchable way, and then anyone who's worried
> about performance can measure the size of the problem, and try to
> think of a workable solution if he doesn't like the results.

So, what's the worst case for this bug? "DELETE FROM table; VACUUM table;"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-16 02:54:31
Message-ID: 11503.1281927271@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... and PANIC is absolutely, entirely, 100% unacceptable here. I don't
>> think you understand the context. We've already written the truncate
>> action to WAL (as we must: WAL before data change). If we PANIC, that
>> means we'll PANIC again during WAL replay. So that solution means a
>> down, and perhaps unrecoverably broken, database.

> All right, that would be bad.

Actually ... after some study of the code, I find that I'm holding this
proposal to a higher standard than the current code maintains.
According to our normal rules for applying WAL-loggable data changes,
there should be a critical section wrapping the application of the
data changes with making the WAL entry. RelationTruncate fails to
do any such thing: it just pushes out the WAL entry and then calls
smgrtruncate, and so any error inside the latter just results in
elog(ERROR). Whereupon the system state is entirely different from what
WAL says it should be. So my previous gut feeling that we need to
rethink this whole thing seems justified.

I traced through everything that leads to an ftruncate() call in the
backend as of HEAD, and found that we have these cases to worry about:

mdunlink calls ftruncate directly, and does nothing worse than
elog(WARNING) on failure. This is fine because all it wants to do
is save some disk space until the file gets deleted "for real" at
the next checkpoint. Failure only means we waste disk space
temporarily, and is surely not cause for panic.

All other calls proceed (sometimes indirectly) from RelationTruncate
or replay of the WAL record it emits. We have not thought hard
about the order of the various truncations this does and whether or
not we have a self-consistent state if it fails partway through.
If we don't want to make the whole thing a critical section, we need
to figure that out.

RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
from heap_truncate_one_rel (which itself does things in an unsafe order),
and from lazy_truncate_heap in VACUUM.

heap_truncate_one_rel has (indirectly) two call sources:

from ExecuteTruncate for a locally created rel, where we don't care,
and would definitely rather NOT have a panic on error: just rolling back
the transaction is fine thank you very much.

from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
Here again, so far as heaps are concerned, rollback on error would be
plenty since any inserted rows would then just be dead. The tricky
part is the indexes for such a table. If we truncate them before
truncating the heap, then the worst possible case is an internally
inconsistent index on a temp table, which will be automatically cleaned
up during the next successful commit in its session. So it's pretty
hard to justify a PANIC response here either.

So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case. And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole call
stack. But I'm running out of steam and don't have a concrete proposal
to make right now. In any case, we've got more problems here than just
the original one of forgetting dirty buffers too soon.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-11-29 14:20:15
Message-ID: 1291040334-sup-3561@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Excerpts from Tom Lane's message of dom ago 15 22:54:31 -0400 2010:

> I'm thinking that we need some sort of what-to-do-on-error flag passed
> into RelationTruncate, plus at least order-of-operations fixes in
> several other places, if not a wholesale refactoring of this whole call
> stack. But I'm running out of steam and don't have a concrete proposal
> to make right now. In any case, we've got more problems here than just
> the original one of forgetting dirty buffers too soon.

I think this fell through the cracks. What are we going to do here?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:43:19
Message-ID: 201102182143.p1ILhJo27167@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers


Did we make any progress on this? Is it a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> ... and PANIC is absolutely, entirely, 100% unacceptable here. I don't
> >> think you understand the context. We've already written the truncate
> >> action to WAL (as we must: WAL before data change). If we PANIC, that
> >> means we'll PANIC again during WAL replay. So that solution means a
> >> down, and perhaps unrecoverably broken, database.
>
> > All right, that would be bad.
>
> Actually ... after some study of the code, I find that I'm holding this
> proposal to a higher standard than the current code maintains.
> According to our normal rules for applying WAL-loggable data changes,
> there should be a critical section wrapping the application of the
> data changes with making the WAL entry. RelationTruncate fails to
> do any such thing: it just pushes out the WAL entry and then calls
> smgrtruncate, and so any error inside the latter just results in
> elog(ERROR). Whereupon the system state is entirely different from what
> WAL says it should be. So my previous gut feeling that we need to
> rethink this whole thing seems justified.
>
> I traced through everything that leads to an ftruncate() call in the
> backend as of HEAD, and found that we have these cases to worry about:
>
> mdunlink calls ftruncate directly, and does nothing worse than
> elog(WARNING) on failure. This is fine because all it wants to do
> is save some disk space until the file gets deleted "for real" at
> the next checkpoint. Failure only means we waste disk space
> temporarily, and is surely not cause for panic.
>
> All other calls proceed (sometimes indirectly) from RelationTruncate
> or replay of the WAL record it emits. We have not thought hard
> about the order of the various truncations this does and whether or
> not we have a self-consistent state if it fails partway through.
> If we don't want to make the whole thing a critical section, we need
> to figure that out.
>
> RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
> from heap_truncate_one_rel (which itself does things in an unsafe order),
> and from lazy_truncate_heap in VACUUM.
>
> heap_truncate_one_rel has (indirectly) two call sources:
>
> from ExecuteTruncate for a locally created rel, where we don't care,
> and would definitely rather NOT have a panic on error: just rolling back
> the transaction is fine thank you very much.
>
> from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
> Here again, so far as heaps are concerned, rollback on error would be
> plenty since any inserted rows would then just be dead. The tricky
> part is the indexes for such a table. If we truncate them before
> truncating the heap, then the worst possible case is an internally
> inconsistent index on a temp table, which will be automatically cleaned
> up during the next successful commit in its session. So it's pretty
> hard to justify a PANIC response here either.
>
> So it seems like the only case where there is really grounds for PANIC
> on failure is the VACUUM case. And there we might apply Heikki's idea
> of trying to zero the untruncatable pages first.
>
> I'm thinking that we need some sort of what-to-do-on-error flag passed
> into RelationTruncate, plus at least order-of-operations fixes in
> several other places, if not a wholesale refactoring of this whole call
> stack. But I'm running out of steam and don't have a concrete proposal
> to make right now. In any case, we've got more problems here than just
> the original one of forgetting dirty buffers too soon.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:46:01
Message-ID: 17230.1298065561@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Did we make any progress on this? Is it a TODO?

AFAIR nothing's been done about it, so it's a TODO.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:47:13
Message-ID: AANLkTinehCGW7O10u2JpRJmxReUjKZMRXBTvhcJy5EGj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> Did we make any progress on this?  Is it a TODO?
>
> AFAIR nothing's been done about it, so it's a TODO.

I was thinking of adding it to the 9.1 open items list, but the wiki's
been down every time I've tried to go there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:50:22
Message-ID: 201102182150.p1ILoMw28056@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas wrote:
> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Bruce Momjian <bruce(at)momjian(dot)us> writes:
> >> Did we make any progress on this? ?Is it a TODO?
> >
> > AFAIR nothing's been done about it, so it's a TODO.
>
> I was thinking of adding it to the 9.1 open items list, but the wiki's
> been down every time I've tried to go there.

It is up now. :-)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:59:38
Message-ID: 25991.1298066378@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> AFAIR nothing's been done about it, so it's a TODO.

> I was thinking of adding it to the 9.1 open items list, but the wiki's
> been down every time I've tried to go there.

Since the problem's been there since forever, I don't see that it's an
open item for 9.1. That list normally is for "must fix before ship"
items, not development projects.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 22:05:14
Message-ID: AANLkTikk5a8p-YhwDfx_U+JNw2z92eXtk+napKCtBmdk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> AFAIR nothing's been done about it, so it's a TODO.
>
>> I was thinking of adding it to the 9.1 open items list, but the wiki's
>> been down every time I've tried to go there.
>
> Since the problem's been there since forever, I don't see that it's an
> open item for 9.1.  That list normally is for "must fix before ship"
> items, not development projects.

OK. If you don't feel it warrants being on that list, then the TODO
is OK with me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 22:10:47
Message-ID: 201102182210.p1IMAlk02903@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas wrote:
> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> AFAIR nothing's been done about it, so it's a TODO.
> >
> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
> >> been down every time I've tried to go there.
> >
> > Since the problem's been there since forever, I don't see that it's an
> > open item for 9.1. ?That list normally is for "must fix before ship"
> > items, not development projects.
>
> OK. If you don't feel it warrants being on that list, then the TODO
> is OK with me.

Agreed. Do you want me to do it, or will you?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 22:48:21
Message-ID: AANLkTika0ZWQO0aYgdct_N7XarVCvj1eGZZRRH9uGfRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Fri, Feb 18, 2011 at 5:10 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >>> AFAIR nothing's been done about it, so it's a TODO.
>> >
>> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
>> >> been down every time I've tried to go there.
>> >
>> > Since the problem's been there since forever, I don't see that it's an
>> > open item for 9.1. ?That list normally is for "must fix before ship"
>> > items, not development projects.
>>
>> OK.  If you don't feel it warrants being on that list, then the TODO
>> is OK with me.
>
> Agreed.  Do you want me to do it, or will you?

You do it. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 23:33:44
Message-ID: 201102182333.p1INXir22974@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Robert Haas wrote:
> On Fri, Feb 18, 2011 at 5:10 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Robert Haas wrote:
> >> On Fri, Feb 18, 2011 at 4:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> >> On Fri, Feb 18, 2011 at 4:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> >>> AFAIR nothing's been done about it, so it's a TODO.
> >> >
> >> >> I was thinking of adding it to the 9.1 open items list, but the wiki's
> >> >> been down every time I've tried to go there.
> >> >
> >> > Since the problem's been there since forever, I don't see that it's an
> >> > open item for 9.1. ?That list normally is for "must fix before ship"
> >> > items, not development projects.
> >>
> >> OK. ?If you don't feel it warrants being on that list, then the TODO
> >> is OK with me.
> >
> > Agreed. ?Do you want me to do it, or will you?
>
> You do it. :-)

Done:

Restructure truncation logic is more resistant to failure

This also involves not writing dirty buffers for a truncated or
dropped relation

* http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +