Quick Links

Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.

Lists:	pgsql-committerspgsql-hackers

From:	Robert Haas <rhaas(at)postgresql(dot)org>
To:	pgsql-committers(at)postgresql(dot)org
Subject:	pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-14 13:42:58
Message-ID:	E1TCWAc-0007M9-Q6@gemulon.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Properly set relpersistence for fake relcache entries.

This can result in buffers failing to be properly flushed at
checkpoint time, leading to data loss.

Report, diagnosis, and patch by Jeff Davis.

Branch
------
REL9_1_STABLE

Details
-------
http://git.postgresql.org/pg/commitdiff/fef2c17807e095a04441e4f1fe05f75d5578ead2

Modified Files
--------------
src/backend/access/transam/xlogutils.c | 5 +++++
src/backend/storage/buffer/bufmgr.c | 2 ++
2 files changed, 7 insertions(+), 0 deletions(-)

From:	"Devrim GUNDUZ" <devrim(at)gunduz(dot)org>
To:	"Robert Haas" <rhaas(at)postgresql(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-14 14:20:55
Message-ID:	48c7fc1b94d9b9db72c31f249b3ed65e.squirrel@tombalak.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi,

Does this mean we need to wrap new tarballs soon, because of the data loss
bug?

Regards,
Devrim

14 Eylül 2012, Cuma, 4:42 pm tarihinde, Robert Haas yazmış:
> Properly set relpersistence for fake relcache entries.
>
> This can result in buffers failing to be properly flushed at
> checkpoint time, leading to data loss.
>
> Report, diagnosis, and patch by Jeff Davis.
>
> Branch
> ------
> REL9_1_STABLE
>
> Details
> -------
> http://git.postgresql.org/pg/commitdiff/fef2c17807e095a04441e4f1fe05f75d5578ead2
>
> Modified Files
> --------------
> src/backend/access/transam/xlogutils.c | 5 +++++
> src/backend/storage/buffer/bufmgr.c | 2 ++
> 2 files changed, 7 insertions(+), 0 deletions(-)
>
>
> --
> Sent via pgsql-committers mailing list (pgsql-committers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-committers
>

--
Temporarily using a webmail program.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	devrim(at)gunduz(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-14 16:28:05
Message-ID:	15739.1347640085@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

"Devrim GUNDUZ" <devrim(at)gunduz(dot)org> writes:
> Does this mean we need to wrap new tarballs soon, because of the data loss
> bug?

I'll defer to Robert on whether this bug is serious enough to merit a
near-term release on its own. But historically, we've wanted to push
out a .1 update two or three weeks after a .0 release, to mop up
whatever new bugs are nasty enough for people to trip over right away.
Maybe we should be thinking in terms of a set of releases around the
end of September.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	devrim(at)gunduz(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-14 17:17:27
Message-ID:	CA+U5nM+vMK8Qa1-=F_ycVoQ2OG2pJFts8SsnkQE6C+0OB2io7w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On 14 September 2012 17:28, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Devrim GUNDUZ" <devrim(at)gunduz(dot)org> writes:
>> Does this mean we need to wrap new tarballs soon, because of the data loss
>> bug?
>
> I'll defer to Robert on whether this bug is serious enough to merit a
> near-term release on its own. But historically, we've wanted to push
> out a .1 update two or three weeks after a .0 release, to mop up
> whatever new bugs are nasty enough for people to trip over right away.
> Maybe we should be thinking in terms of a set of releases around the
> end of September.

The bug itself is not major, but the extent and user impact is serious.

In my opinion we should issue a new release of 9.1 immediately.

9.2 might have different treatment but does suffer the same problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-15 13:05:23
Message-ID:	D3B1605C-6B5D-4F19-BE3C-4398762E4412@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sep 14, 2012, at 12:17 PM, Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> The bug itself is not major, but the extent and user impact is serious.

I don't think I understand how you're using the word major there. I seem to recall some previous disputation between you and I about the use of that term, so maybe it would be good to get that cleared up. To me major and serious mean about the same thing, so it can't for me be one but not the other.

Definitions aside, I think it's a pretty scary issue. It basically means that if you have a recovery (crash or archive) during which you read a buffer into memory, the buffer won't be checkpointed. So if, before the buffer is next evicted, you have a crash, and if at least one checkpoint has intervened between the most recent WAL-logged operation on the buffer and the crash, you're hosed. That's not a terribly unlikely scenario.

While I can't claim to understand exactly what our standards for forcing an immediate minor release are, I think this is pretty darn bad. I certainly don't want my customers running with this for a minute longer than necessary, and I feel really bad for letting it get into a release, let alone go undetected for this long. :-(

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-15 16:29:25
Message-ID:	16378.1347726565@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Definitions aside, I think it's a pretty scary issue. It basically means that if you have a recovery (crash or archive) during which you read a buffer into memory, the buffer won't be checkpointed. So if, before the buffer is next evicted, you have a crash, and if at least one checkpoint has intervened between the most recent WAL-logged operation on the buffer and the crash, you're hosed. That's not a terribly unlikely scenario.

This is only an issue on standby slaves or when doing a PITR recovery, no?
As far as I can tell from the discussion, it would *not* affect crash
recovery, because we don't do restartpoints during crash recovery.

> While I can't claim to understand exactly what our standards for forcing an immediate minor release are, I think this is pretty darn bad. I certainly don't want my customers running with this for a minute longer than necessary, and I feel really bad for letting it get into a release, let alone go undetected for this long. :-(

There's been some discussion about it among -core already. The earliest
we could possibly do anything would be a release this coming week (that
is, wrap Thursday for release Monday 9/24). However, considering that
a lot of key people will be attending PG Open between now and Thursday,
I'm not sure how practical that really would be. Waiting a week might
be better, and it would give more time for initial bug reports against
9.2.0 to filter in.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-15 19:00:13
Message-ID:	201209152100.13244.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Saturday, September 15, 2012 06:29:25 PM Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > Definitions aside, I think it's a pretty scary issue. It basically means
> > that if you have a recovery (crash or archive) during which you read a
> > buffer into memory, the buffer won't be checkpointed. So if, before the
> > buffer is next evicted, you have a crash, and if at least one checkpoint
> > has intervened between the most recent WAL-logged operation on the
> > buffer and the crash, you're hosed. That's not a terribly unlikely
> > scenario.
>
> This is only an issue on standby slaves or when doing a PITR recovery, no?
> As far as I can tell from the discussion, it would not affect crash
> recovery, because we don't do restartpoints during crash recovery.
I think unfortunately it does. At the end of recovery we perform a
END_OF_RECOVERY checkpoint that seems to suffer from these issues. While
CreateCheckPoint() itself treats that kind of checkpoint similarly to a
shutdown checkpoint it doesn't pass that similarity to BufferSync (via
CheckPointGuts->CheckPointBuffers).

I hope I missed something ...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-16 02:39:51
Message-ID:	99C1FCEF-43B0-417F-832A-CDC8938B7A14@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sep 15, 2012, at 11:29 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Definitions aside, I think it's a pretty scary issue. It basically means that if you have a recovery (crash or archive) during which you read a buffer into memory, the buffer won't be checkpointed. So if, before the buffer is next evicted, you have a crash, and if at least one checkpoint has intervened between the most recent WAL-logged operation on the buffer and the crash, you're hosed. That's not a terribly unlikely scenario.
>
> This is only an issue on standby slaves or when doing a PITR recovery, no?
> As far as I can tell from the discussion, it would *not* affect crash
> recovery, because we don't do restartpoints during crash recovery.

No, I think it does affect crash recovery. Whether or not restartspoints happen during recovery doesn't matter; what does matter is that after recovery there may be shared buffers that are erroneously not marked as permanent. Such buffers won't be checkpointed except at shutdown time, which is wrong.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-16 04:32:18
Message-ID:	205.1347769938@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sep 15, 2012, at 11:29 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> This is only an issue on standby slaves or when doing a PITR recovery, no?
>> As far as I can tell from the discussion, it would *not* affect crash
>> recovery, because we don't do restartpoints during crash recovery.

> No, I think it does affect crash recovery. Whether or not restartspoints happen during recovery doesn't matter; what does matter is that after recovery there may be shared buffers that are erroneously not marked as permanent. Such buffers won't be checkpointed except at shutdown time, which is wrong.

Right, but we do a shutdown checkpoint at the end of crash recovery.

I could believe that this case gets missed, but it's not clear from
what's been said --- and if that case *is* broken, there should have
been a whole lot more corruption reports recently than what we've seen.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-16 12:10:38
Message-ID:	27AF9583-3084-4EE1-BC45-6F69A2DB076C@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sep 15, 2012, at 11:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Right, but we do a shutdown checkpoint at the end of crash recovery.

Yes, but that only writes the buffers that are dirty. It doesn't fix the lack of a BM_PERMANENT flag on a buffer that ought to have had one. So that page can now get modified AGAIN, after recovery, and not checkpointed.

> I could believe that this case gets missed, but it's not clear from
> what's been said --- and if that case *is* broken, there should have
> been a whole lot more corruption reports recently than what we've seen.

It's pretty clear from Jeff's example at the top of the thread, which involves two crashes and no archive recovery.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-17 05:35:06
Message-ID:	7579.1347860106@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sep 15, 2012, at 11:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Right, but we do a shutdown checkpoint at the end of crash recovery.

> Yes, but that only writes the buffers that are dirty. It doesn't fix the lack of a BM_PERMANENT flag on a buffer that ought to have had one. So that page can now get modified AGAIN, after recovery, and not checkpointed.

Ugh. Yeah, we need to fix this ASAP. I've notified pgsql-packagers to
expect a release this week.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-17 06:44:42
Message-ID:	201209170844.43077.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Monday, September 17, 2012 07:35:06 AM Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sep 15, 2012, at 11:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Right, but we do a shutdown checkpoint at the end of crash recovery.
(as noted somewhere else and tackled by Simon, a END_OF_RECOVERY didn't sync
those before)

> > Yes, but that only writes the buffers that are dirty. It doesn't fix the
> > lack of a BM_PERMANENT flag on a buffer that ought to have had one. So
> > that page can now get modified AGAIN, after recovery, and not
> > checkpointed.
>
> Ugh. Yeah, we need to fix this ASAP. I've notified pgsql-packagers to
> expect a release this week.
Btw, I played with this some more on Saturday and I think, while definitely a
bad bug, the actual consequences aren't as bad as at least I initially feared.

Fake relcache entries are currently set in 3 scenarios during recovery:
1. removal of ALL_VISIBLE in heapam.c
2. incomplete splits and incomplete deletions in nbtxlog.c
3. incomplete splits in ginxlog.c

As Jeff nicely showed its easy to corrupt the visibilitymap with this.
Fortunately in < 9.2 that doesn't have too bad consequences and will be fixed
by the next vacuum. In 9.2 that obviously can result in wrong query results but
will still be fixed by a later (probably full table) vacuum.

To hit 2) and 3) the server needs to have crashed (or a strange
recovery_target_* set) while doing a multilevel operation. I have only
cursorily looked at gin but it looks to me like in both, nbtree and gin, the
window during logging the multiple steps in a split/deletion is fairly short
and its not likely that we crashed exactly during that. Unless we haven't read
the complete multistep operation during recovery we won't ever create fake
relcache entries for those relations/buffers! And even if that edge case is hit
it seems somewhat likely that the few pages that are read with the fake entry
are still in the cache (the incomplete operation has to have been soon before)
and thus won't get the bad relpersistence flag set.

So I think while that bug had the possibility of being really bad we were
pretty lucky...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-17 08:44:39
Message-ID:	CA+U5nMJ=Y8FNqFcJuxwBHDjMv7T5gcFgO1a=Rj8wPOUY=mx6XA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On 17 September 2012 07:44, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> So I think while that bug had the possibility of being really bad we were
> pretty lucky...

Yes, agreed. The impact is not as severe as I originally thought.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-17 13:58:37
Message-ID:	234.1347890317@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> Btw, I played with this some more on Saturday and I think, while definitely a
> bad bug, the actual consequences aren't as bad as at least I initially feared.

> Fake relcache entries are currently set in 3 scenarios during recovery:
> 1. removal of ALL_VISIBLE in heapam.c
> 2. incomplete splits and incomplete deletions in nbtxlog.c
> 3. incomplete splits in ginxlog.c
> [ #1 doesn't really hurt in 9.1, and the others are low probability ]

OK, that explains why we've not seen a blizzard of trouble reports.
Still seems like a good idea to fix it ASAP, though.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-19 17:47:31
Message-ID:	1255.1348076851@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

BTW, what should our advice be for recovering from corruption due to
this bug? As far as the btree and GIN problems go, we can tell people
that REINDEX will fix it. And in 9.1, you don't really need to worry
about the visibility map being bad. But what do you do in 9.2, if
you have a bad visibility map? Is there any fix short of VACUUM FULL?

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-19 18:21:29
Message-ID:	CA+U5nMJh4_Q1qxh-rbrAZh6jMs2omZ+-SsgR18GWM1uZQtB9qg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On 19 September 2012 18:47, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> BTW, what should our advice be for recovering from corruption due to
> this bug? As far as the btree and GIN problems go, we can tell people
> that REINDEX will fix it. And in 9.1, you don't really need to worry
> about the visibility map being bad. But what do you do in 9.2, if
> you have a bad visibility map? Is there any fix short of VACUUM FULL?

SET vacuum_freeze_table_age = 0;
VACUUM;

I'm writing the full notes out now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-20 21:55:05
Message-ID:	201209202355.05332.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Monday, September 17, 2012 03:58:37 PM Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > Btw, I played with this some more on Saturday and I think, while
> > definitely a bad bug, the actual consequences aren't as bad as at least
> > I initially feared.
> >
> > Fake relcache entries are currently set in 3 scenarios during recovery:
> > 1. removal of ALL_VISIBLE in heapam.c
> > 2. incomplete splits and incomplete deletions in nbtxlog.c
> > 3. incomplete splits in ginxlog.c
> > [ #1 doesn't really hurt in 9.1, and the others are low probability ]
>
> OK, that explains why we've not seen a blizzard of trouble reports.
> Still seems like a good idea to fix it ASAP, though.
Btw, I think RhodiumToad/Andrew Gierth and I some time ago helped a user in the
IRC Channel that had symptoms matching this bug.

Situation was that he started to get very high IO and xid wraparound shutdown
warnings due to never finishing and not canceleable autovacuums. After some
investigation it turned out that btree indexes were processed at that time. We
found they had cyclic btpo_next pointers leading to an endless loop in
_bt_pagedel.
We solved the issue by forcing leftsib = P_NONE inside the
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
which let a queue DROP INDEX get the necessary locks.

Unfortuantely this was on a busy production system with a nearing shutdown, so
not much was kept for further diagnosis.

After this bug was discovered I asked the user and indeed they previously
shutdown the database twice in quick succession during heavy activity with -m
immediate which could exactly lead to such a problem due to incompletely
processed page splits.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Marko Tiikkaja <pgmail(at)joh(dot)to>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-21 13:30:31
Message-ID:	505C6BF7.3090808@joh.to
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On 9/20/12 11:55 PM, Andres Freund wrote:
> On Monday, September 17, 2012 03:58:37 PM Tom Lane wrote:
>> OK, that explains why we've not seen a blizzard of trouble reports.
>> Still seems like a good idea to fix it ASAP, though.
> Btw, I think RhodiumToad/Andrew Gierth and I some time ago helped a user in the
> IRC Channel that had symptoms matching this bug.

Another such user reporting in. :-(

Our slave started accumulating WAL files and ran out of disk space
yesterday. After investigation from Andres and Andrew, it turns out
that we were most likely hit by this very same bug.

Here's what they have to say:
"If the db crashes between logging the split and the parent-node insert,
then in recovery, since relpersistence is not initialized correctly,
when the recovery process tries to complete the operation, no xlog
record is written for the insert. If there's a slave server, then the
missing xlog record for the insert means that the slave's
incomplete_actions queue never becomes empty, therefore the slave can no
longer do recovery restartpoints."

Some relevant information:

[cur:92/314BC870, xid:76872047, rmid:10(Heap), len/tot_len:91/123,
info:0, prev:92/314BB890] insert: s/d/r:1663/408841/415746
blk/off:13904/65 header: t_infomask2 8 t_infomask 2050 t_hoff 24
[cur:92/314BC8F0, xid:76872047, rmid:11(Btree), len/tot_len:702/734,
info:64, prev:92/314BC870] split_r: s/d/r:1663/408841/475676 leftsib 2896
[cur:92/314BCBD0, xid:0, rmid:0(XLOG), len/tot_len:56/88, info:0,
prev:92/314BC8F0] checkpoint: redo 146/314BCBD0; tli 1; nextxid
76872048; nextoid 764990; nextmulti 62062; nextoffset 132044; shutdown
at 2012-09-11 14:26:26 CEST

2012-09-11 14:26:26.719 CEST,,,44620,,504f2df2.ae4c,5,,2012-09-11
14:26:26 CEST,,0,LOG,00000,"redo done at
92/314BC8F0",,,,,,,,"StartupXLOG, xlog.c:6641",""

And apparently the relpersistence check in RelationNeedsWAL() call in
_bt_insertonpg had a role in this as well.

Regards,
Marko Tiikkaja

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Marko Tiikkaja <pgmail(at)joh(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-21 14:41:12
Message-ID:	201209211641.13259.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Friday, September 21, 2012 03:30:31 PM Marko Tiikkaja wrote:
> On 9/20/12 11:55 PM, Andres Freund wrote:
> > On Monday, September 17, 2012 03:58:37 PM Tom Lane wrote:
> >> OK, that explains why we've not seen a blizzard of trouble reports.
> >> Still seems like a good idea to fix it ASAP, though.
> >
> > Btw, I think RhodiumToad/Andrew Gierth and I some time ago helped a user
> > in the IRC Channel that had symptoms matching this bug.
>
> Another such user reporting in. :-(
>
> Our slave started accumulating WAL files and ran out of disk space
> yesterday. After investigation from Andres and Andrew, it turns out
> that we were most likely hit by this very same bug.
>
> Here's what they have to say:
> "If the db crashes between logging the split and the parent-node insert,
> then in recovery, since relpersistence is not initialized correctly,
> when the recovery process tries to complete the operation, no xlog
> record is written for the insert. If there's a slave server, then the
> missing xlog record for the insert means that the slave's
> incomplete_actions queue never becomes empty, therefore the slave can no
> longer do recovery restartpoints."
>
> Some relevant information:
>
> [cur:92/314BC870, xid:76872047, rmid:10(Heap), ... insert: ...
> [cur:92/314BC8F0, xid:76872047, rmid:11(Btree), ... split_r: ...
> [cur:92/314BCBD0, xid:0, rmid:0(XLOG), len/tot_len:56/88, info:0,
> prev:92/314BC8F0] checkpoint: redo 146/314BCBD0; ... shutdown
> ... "redo done at 92/314BC8F0",,,,,,,,"StartupXLOG, xlog.c:6641",""
Which means that an insert into the heap, triggered a btree split. At that
point the database crashed. During recovery the split was supposed to be
finished by the btree cleanup code.

> And apparently the relpersistence check in RelationNeedsWAL() call in
> _bt_insertonpg had a role in this as well.
When detecting an incomplete split the nbtree cleanup code calls
_bt_insert_parent, which calls _bt_insertonpg. Which finishes the split. BUT:
it doesn't log that it finished because RelationNeedsWal() says it doesn't need
to.

That means:
* indexes on stanbys will *definitely* be corrupted
* a standby won't perform any restartpoints anymore till restarted
* if the primary crashes corruption is likely.

Hrm. I retract my earlier statement about the low likelihood of corruption due
to this.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Marko Tiikkaja <pgmail(at)joh(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "devrim(at)gunduz(dot)org" <devrim(at)gunduz(dot)org>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 00:42:35
Message-ID:	CA+TgmoZkB0tXUWnRWQJbejeVbiZao1BGNVF5v0aXrR2XRX54xw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Fri, Sep 21, 2012 at 10:41 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hrm. I retract my earlier statement about the low likelihood of corruption due
> to this.

Yeah. :-(

We've recently had at least one report of autovacuum failing to
terminate due to a series of index pages forming a circular loop, and
at least one case where it appears that the data became not-unique on
a column upon which a unique index existed, in releases that contain
this bug.

It seems therefore that REINDEX + VACUUM with
vacuum_freeze_table_age=0 is not quite sufficient to recover from this
problem. If your index has come to contain a circularity, vacuum will
fail to terminate, and you'll need to drop it completely to recover.
And if you were relying on your index to enforce a unique constraint
and it didn't, you'll need to do manual data repair before it will be
possible to rebuild or replace that index.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Виктор Егоров <vyegorov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 05:57:06
Message-ID:	CAGnEboi4bNEH-QQ4xs=9gzrN9VAWh16-775bL=zev61sMNMx-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

I'm afraid I'm exactly in this situation now.

Last entry from the 9.1.6 recommended VACUUM (FREEZE, VERBOSE, ANALYZE) was:
INFO: "meta_version_chunks": found 55363 removable, 32566245 nonremovable
row versions in 450292 out of 450292 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 588315 unused item pointers.
0 pages are entirely empty.
CPU 2.44s/5.77u sec elapsed 2150.18 sec.
INFO: vacuuming "pg_toast.pg_toast_16582"

And here're are the locks held by the VACCUM backend:
select
oid,relname,relkind,relpages,reltuples::numeric(15,0),reltoastrelid,reltoastidxid
from pg_class
where oid in (select relation from pg_locks where pid = 1380);
oid | relname | relkind | relpages | reltuples |
reltoastrelid | reltoastidxid
-------+----------------------+---------+----------+-----------+---------------+---------------
16585 | pg_toast_16582 | t | 16460004 | 58161600 |
0 | 16587
16587 | pg_toast_16582_index | i | 188469 | 58161600 |
0 | 0
16582 | meta_version_chunks | r | 450292 | 32566200 |
16585 | 0

I will not touch anything and would like to get some recommendations on how
to proceed.

2012/9/26 Robert Haas <robertmhaas(at)gmail(dot)com>

> On Fri, Sep 21, 2012 at 10:41 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
> > Hrm. I retract my earlier statement about the low likelihood of
> corruption due
> > to this.
>
> Yeah. :-(
>
> We've recently had at least one report of autovacuum failing to
> terminate due to a series of index pages forming a circular loop, and
> at least one case where it appears that the data became not-unique on
> a column upon which a unique index existed, in releases that contain
> this bug.
>
> It seems therefore that REINDEX + VACUUM with
> vacuum_freeze_table_age=0 is not quite sufficient to recover from this
> problem. If your index has come to contain a circularity, vacuum will
> fail to terminate, and you'll need to drop it completely to recover.
> And if you were relying on your index to enforce a unique constraint
> and it didn't, you'll need to do manual data repair before it will be
> possible to rebuild or replace that index.
>

--
Victor Y. Yegorov

From:	Виктор Егоров <vyegorov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 06:12:37
Message-ID:	CAGnEboiCwP9V67M9GyJa3V4=0-BF120jjD54xddVH4bxV3vb_A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Forget to mention, that:
- VACUUM is running on the master;
- current state is unchanged for 20 hours.

2012/9/26 Виктор Егоров <vyegorov(at)gmail(dot)com>:
> I'm afraid I'm exactly in this situation now.
>
> Last entry from the 9.1.6 recommended VACUUM (FREEZE, VERBOSE, ANALYZE) was:
> INFO: "meta_version_chunks": found 55363 removable, 32566245 nonremovable
> row versions in 450292 out of 450292 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 588315 unused item pointers.
> 0 pages are entirely empty.
> CPU 2.44s/5.77u sec elapsed 2150.18 sec.
> INFO: vacuuming "pg_toast.pg_toast_16582"

--
Victor Y. Yegorov

From:	Devrim GÜNDÜZ <devrim(at)gunduz(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Marko Tiikkaja <pgmail(at)joh(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 06:46:32
Message-ID:	1348641992.3516.45.camel@lenovo01-laptop03.gunduz.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi,

On Tue, 2012-09-25 at 20:42 -0400, Robert Haas wrote:

> It seems therefore that REINDEX + VACUUM with
> vacuum_freeze_table_age=0 is not quite sufficient to recover from this
> problem. If your index has come to contain a circularity, vacuum will
> fail to terminate, and you'll need to drop it completely to recover.

What is the difference between REINDEX and dropping and recreating
index? (or say, creating the same index with another name, dropping old
one, and renaming the new one with old one?)

Regards,
--
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Виктор Егоров <vyegorov(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 08:45:56
Message-ID:	201209261045.56608.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi,

On Wednesday, September 26, 2012 07:57:06 AM Виктор Егоров wrote:
> I'm afraid I'm exactly in this situation now.
:(

> Last entry from the 9.1.6 recommended VACUUM (FREEZE, VERBOSE, ANALYZE)
It recommended doing a REINDEX first though? I guess you didn't do that?

> was: INFO: "meta_version_chunks": found 55363 removable, 32566245
> nonremovable row versions in 450292 out of 450292 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 588315 unused item pointers.
> 0 pages are entirely empty.
> CPU 2.44s/5.77u sec elapsed 2150.18 sec.
> INFO: vacuuming "pg_toast.pg_toast_16582"
>
> And here're are the locks held by the VACCUM backend:
> select
> oid,relname,relkind,relpages,reltuples::numeric(15,0),reltoastrelid,reltoas
> tidxid from pg_class
> where oid in (select relation from pg_locks where pid = 1380);
> oid | relname | relkind | relpages | reltuples |
> reltoastrelid | reltoastidxid
> -------+----------------------+---------+----------+-----------+-----------
> ----+--------------- 16585 | pg_toast_16582 | t | 16460004 |
> 58161600 |
> 0 | 16587
> 16587 | pg_toast_16582_index | i | 188469 | 58161600 |
> 0 | 0
> 16582 | meta_version_chunks | r | 450292 | 32566200 |
> 16585 | 0
>
> I will not touch anything and would like to get some recommendations on how
> to proceed.

On Wednesday, September 26, 2012 08:12:37 AM Виктор Егоров wrote:
> Forget to mention, that:
> - VACUUM is running on the master;
> - current state is unchanged for 20 hours.
I guess you cannot cancel the vacuum? Last time it was in a cycle without
checking interrupts inbetween.

Can you restart the server?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Виктор Егоров <vyegorov(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [COMMITTERS] pgsql: Properly set relpersistence for fake relcache entries.
Date:	2012-09-26 08:59:49
Message-ID:	CAGnEbohjarUE5a2pRMbyy6GxOoLEGY4rWYBHpgQq6DbPy7SB6A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

You're right, REINDEX was not done.

I've stopped the VACUUM, did a proper server restart (pg_ctl -m fast -w restart)
and will work on rebuilding relations.
Seems like I have another issue with a bunch of bloated tables on my way also.

Thanks for the support.

2012/9/26 Andres Freund <andres(at)2ndquadrant(dot)com>:
>> Last entry from the 9.1.6 recommended VACUUM (FREEZE, VERBOSE, ANALYZE)
> It recommended doing a REINDEX first though? I guess you didn't do that?
>
> ...
>
> I guess you cannot cancel the vacuum? Last time it was in a cycle without
> checking interrupts inbetween.
>
> Can you restart the server?

--
Victor Y. Yegorov