Lists: | pgsql-hackers |
---|
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-28 17:58:02 |
Message-ID: | 20130528175802.GA26645@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
A customer of ours reporting a standby loosing sync with the primary due
to the following error:
CONTEXT: xlog redo visible: rel 1663/XXX/XXX; blk 173717
WARNING: page 173717 of relation base/XXX/XXX is uninitialized
...
PANIC: WAL contains references to invalid pages
Guessing around I looked and noticed the following problematic pattern:
1) A: wants to do an update, doesn't have enough freespace
2) A: extends the relation on the filesystem level (RelationGetBufferForTuple)
3) A: does PageInit (RelationGetBufferForTuple)
4) A: aborts, e.g. due to a serialization failure (heap_update)
At this point the page is initialized in memory, but not wal logged. It
isn't pinned or locked either.
5) B: vacuum finds that page and it's empty. So it marks it all
visible. But since the page wasn't written out (we haven't even marked
it dirty in 3.) the standby doesn't know that and reports the page as
being uninitialized.
ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
log an FPI for the heap page via visibilitymap_set in that rather
limited case.
Happy to provide a patch unless somebody has a better idea?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-29 01:36:17 |
Message-ID: | CA+TgmobBLPZE=77pab54AmO38h7rgVewQB5J6AoViNBSSexp1g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, May 28, 2013 at 1:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Guessing around I looked and noticed the following problematic pattern:
> 1) A: wants to do an update, doesn't have enough freespace
> 2) A: extends the relation on the filesystem level (RelationGetBufferForTuple)
> 3) A: does PageInit (RelationGetBufferForTuple)
> 4) A: aborts, e.g. due to a serialization failure (heap_update)
>
> At this point the page is initialized in memory, but not wal logged. It
> isn't pinned or locked either.
>
> 5) B: vacuum finds that page and it's empty. So it marks it all
> visible. But since the page wasn't written out (we haven't even marked
> it dirty in 3.) the standby doesn't know that and reports the page as
> being uninitialized.
>
> ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
> log an FPI for the heap page via visibilitymap_set in that rather
> limited case.
>
> Happy to provide a patch unless somebody has a better idea?
Good catch. However, I would suggest using log_newpage() before
visibilitymap_set() rather than trying to stick extra logic into
visibilitymap_set(). I think that will be cleaner and simpler.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-29 01:56:38 |
Message-ID: | 20130529015638.GB13486@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-05-28 21:36:17 -0400, Robert Haas wrote:
> On Tue, May 28, 2013 at 1:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Guessing around I looked and noticed the following problematic pattern:
> > 1) A: wants to do an update, doesn't have enough freespace
> > 2) A: extends the relation on the filesystem level (RelationGetBufferForTuple)
> > 3) A: does PageInit (RelationGetBufferForTuple)
> > 4) A: aborts, e.g. due to a serialization failure (heap_update)
> >
> > At this point the page is initialized in memory, but not wal logged. It
> > isn't pinned or locked either.
> >
> > 5) B: vacuum finds that page and it's empty. So it marks it all
> > visible. But since the page wasn't written out (we haven't even marked
> > it dirty in 3.) the standby doesn't know that and reports the page as
> > being uninitialized.
> >
> > ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
> > log an FPI for the heap page via visibilitymap_set in that rather
> > limited case.
> >
> > Happy to provide a patch unless somebody has a better idea?
>
> Good catch. However, I would suggest using log_newpage() before
> visibilitymap_set() rather than trying to stick extra logic into
> visibilitymap_set(). I think that will be cleaner and simpler.
Thought about that, but given that 9.3's visibilitymap_set already will
already FPI heap pages I concluded it wouldn't really be an improvement
since it's only one ||log_heap_page or so there. Not sure what's
better. Will write the patch and see how it goes.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-29 13:57:49 |
Message-ID: | 20130529135749.GB3955@alap2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-05-29 03:56:38 +0200, Andres Freund wrote:
> On 2013-05-28 21:36:17 -0400, Robert Haas wrote:
> > On Tue, May 28, 2013 at 1:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > Guessing around I looked and noticed the following problematic pattern:
> > > 1) A: wants to do an update, doesn't have enough freespace
> > > 2) A: extends the relation on the filesystem level (RelationGetBufferForTuple)
> > > 3) A: does PageInit (RelationGetBufferForTuple)
> > > 4) A: aborts, e.g. due to a serialization failure (heap_update)
> > >
> > > At this point the page is initialized in memory, but not wal logged. It
> > > isn't pinned or locked either.
> > >
> > > 5) B: vacuum finds that page and it's empty. So it marks it all
> > > visible. But since the page wasn't written out (we haven't even marked
> > > it dirty in 3.) the standby doesn't know that and reports the page as
> > > being uninitialized.
> > >
> > > ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
> > > log an FPI for the heap page via visibilitymap_set in that rather
> > > limited case.
> > >
> > > Happy to provide a patch unless somebody has a better idea?
> >
> > Good catch. However, I would suggest using log_newpage() before
> > visibilitymap_set() rather than trying to stick extra logic into
> > visibilitymap_set(). I think that will be cleaner and simpler.
>
> Thought about that, but given that 9.3's visibilitymap_set already will
> already FPI heap pages I concluded it wouldn't really be an improvement
> since it's only one ||log_heap_page or so there. Not sure what's
> better. Will write the patch and see how it goes.
Ended up using log_newpage_buffer since reusing visibilitymap_set's
record would break the wal format as we currently do not accept an FPI
on the heap pages during replay when < 9.3. Forcing to upgrade the
client first would be rather unfriendly...
That has the disadvantage of logging a full heap page since it doesn't
use the hole optimization but this happens really infrequently, so ...
I don't think this can happen < 9.2 in at least this code path. I wonder
though if there are other codepaths were this can happen although I
haven't found any on a cursory inspection.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Ensure-that-all_visible-WAL-records-operate-on-an-in.patch | text/x-patch | 2.1 KB |
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-30 03:01:31 |
Message-ID: | CA+Tgmob2-UahB8vYwDet3VV3tjhShwLa8_JSirRV8HuU7z0Lvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, May 29, 2013 at 9:57 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Thought about that, but given that 9.3's visibilitymap_set already will
>> already FPI heap pages I concluded it wouldn't really be an improvement
>> since it's only one ||log_heap_page or so there. Not sure what's
>> better. Will write the patch and see how it goes.
>
> Ended up using log_newpage_buffer since reusing visibilitymap_set's
> record would break the wal format as we currently do not accept an FPI
> on the heap pages during replay when < 9.3. Forcing to upgrade the
> client first would be rather unfriendly...
>
> That has the disadvantage of logging a full heap page since it doesn't
> use the hole optimization but this happens really infrequently, so ...
Yeah, I think it's fine. The patch also looks fine, although I think
the comments could use a bit of tidying. I guess we need to
back-patch this all the way back to 8.4? It will require some
adjustments for the older branches.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-05-30 06:29:31 |
Message-ID: | 20130530062931.GA4201@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-05-29 23:01:31 -0400, Robert Haas wrote:
> On Wed, May 29, 2013 at 9:57 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Thought about that, but given that 9.3's visibilitymap_set already will
> >> already FPI heap pages I concluded it wouldn't really be an improvement
> >> since it's only one ||log_heap_page or so there. Not sure what's
> >> better. Will write the patch and see how it goes.
> >
> > Ended up using log_newpage_buffer since reusing visibilitymap_set's
> > record would break the wal format as we currently do not accept an FPI
> > on the heap pages during replay when < 9.3. Forcing to upgrade the
> > client first would be rather unfriendly...
> >
> > That has the disadvantage of logging a full heap page since it doesn't
> > use the hole optimization but this happens really infrequently, so ...
>
> Yeah, I think it's fine. The patch also looks fine, although I think
> the comments could use a bit of tidying. I guess we need to
> back-patch this all the way back to 8.4? It will require some
> adjustments for the older branches.
I think 9.2 is actually far enough and it should apply there. Before
that we only logged the unsetting of all_visible via
heap_(inset|update|delete)'s wal records not the setting as far as I can
tell. So I don't immediately see a danger < 9.2.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-06-06 14:22:14 |
Message-ID: | CA+TgmoZQuXOhdR_XEYLC4fiYWTqAVc6RCx6XUvdVPXvY8=7=Vg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, May 30, 2013 at 2:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Yeah, I think it's fine. The patch also looks fine, although I think
>> the comments could use a bit of tidying. I guess we need to
>> back-patch this all the way back to 8.4? It will require some
>> adjustments for the older branches.
>
> I think 9.2 is actually far enough and it should apply there. Before
> that we only logged the unsetting of all_visible via
> heap_(inset|update|delete)'s wal records not the setting as far as I can
> tell. So I don't immediately see a danger < 9.2.
OK. I have committed this. For 9.2, I had to backport
log_newpage_buffer() and use XLByteEQ rather than ==.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-06-06 14:28:13 |
Message-ID: | 20130606142813.GD29964@alap2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-06-06 10:22:14 -0400, Robert Haas wrote:
> On Thu, May 30, 2013 at 2:29 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Yeah, I think it's fine. The patch also looks fine, although I think
> >> the comments could use a bit of tidying. I guess we need to
> >> back-patch this all the way back to 8.4? It will require some
> >> adjustments for the older branches.
> >
> > I think 9.2 is actually far enough and it should apply there. Before
> > that we only logged the unsetting of all_visible via
> > heap_(inset|update|delete)'s wal records not the setting as far as I can
> > tell. So I don't immediately see a danger < 9.2.
>
> OK. I have committed this. For 9.2, I had to backport
> log_newpage_buffer() and use XLByteEQ rather than ==.
Thanks!
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-09-23 11:41:16 |
Message-ID: | 524028DC.60800@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 06.06.2013 17:22, Robert Haas wrote:
> On Thu, May 30, 2013 at 2:29 AM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
>>> Yeah, I think it's fine. The patch also looks fine, although I think
>>> the comments could use a bit of tidying. I guess we need to
>>> back-patch this all the way back to 8.4? It will require some
>>> adjustments for the older branches.
>>
>> I think 9.2 is actually far enough and it should apply there. Before
>> that we only logged the unsetting of all_visible via
>> heap_(inset|update|delete)'s wal records not the setting as far as I can
>> tell. So I don't immediately see a danger< 9.2.
>
> OK. I have committed this. For 9.2, I had to backport
> log_newpage_buffer() and use XLByteEQ rather than ==.
I'm afraid this patch was a few bricks shy of a load. The
log_newpage_buffer() function asserts that:
> /* We should be in a critical section. */
> Assert(CritSectionCount > 0);
But the call in vacuumlazy.c is not inside a critical section. Also, the
comments in log_newpage_buffer() say that the caller should mark the
buffer dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c,
it's marked dirty afterwards. I'm not sure what consequences that might
have, but at least it contradicts the comment.
(spotted this while working on a patch, and ran into the assertion on
crash recovery)
- Heikki
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-09-23 12:06:07 |
Message-ID: | 20130923120607.GA27222@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-09-23 14:41:16 +0300, Heikki Linnakangas wrote:
> On 06.06.2013 17:22, Robert Haas wrote:
> >On Thu, May 30, 2013 at 2:29 AM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
> >>>Yeah, I think it's fine. The patch also looks fine, although I think
> >>>the comments could use a bit of tidying. I guess we need to
> >>>back-patch this all the way back to 8.4? It will require some
> >>>adjustments for the older branches.
> >>
> >>I think 9.2 is actually far enough and it should apply there. Before
> >>that we only logged the unsetting of all_visible via
> >>heap_(inset|update|delete)'s wal records not the setting as far as I can
> >>tell. So I don't immediately see a danger< 9.2.
> >
> >OK. I have committed this. For 9.2, I had to backport
> >log_newpage_buffer() and use XLByteEQ rather than ==.
>
> I'm afraid this patch was a few bricks shy of a load. The
> log_newpage_buffer() function asserts that:
>
> > /* We should be in a critical section. */
> > Assert(CritSectionCount > 0);
>
> But the call in vacuumlazy.c is not inside a critical section.
Hrmpf. Sorry for that. Will provide a patch.
> Also, the
> comments in log_newpage_buffer() say that the caller should mark the buffer
> dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's
> marked dirty afterwards. I'm not sure what consequences that might have, but
> at least it contradicts the comment.
We generally should do that for wal logging - I am not sure why
log_newpage is not doing that itself, but whatever.
> (spotted this while working on a patch, and ran into the assertion on crash
> recovery)
You got the assertion failure about CritSectionCount during recovery?
If so, I do not understand, that code shouldn't be executed there? Or do
you mean you patched a version that didn't include that patch and it
Asserted during recovery because of the missing lsn?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>,pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-09-23 12:23:07 |
Message-ID: | 046df64f-b5e1-4112-a4b5-34a2fbf2104c@email.android.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>On 2013-09-23 14:41:16 +0300, Heikki Linnakangas wrote:
>> (spotted this while working on a patch, and ran into the assertion on
>crash
>> recovery)
>
>You got the assertion failure about CritSectionCount during recovery?
>If so, I do not understand, that code shouldn't be executed there? Or
>do
>you mean you patched a version that didn't include that patch and it
>Asserted during recovery because of the missing lsn?
Sorry, I was imprecise. I ran into it after recovery, on vacuum.
- Heikki
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-09-24 11:25:41 |
Message-ID: | 20130924112541.GA23917@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 2013-09-23 14:41:16 +0300, Heikki Linnakangas wrote:
> On 06.06.2013 17:22, Robert Haas wrote:
> >On Thu, May 30, 2013 at 2:29 AM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
> >>>Yeah, I think it's fine. The patch also looks fine, although I think
> >>>the comments could use a bit of tidying. I guess we need to
> >>>back-patch this all the way back to 8.4? It will require some
> >>>adjustments for the older branches.
> >>
> >>I think 9.2 is actually far enough and it should apply there. Before
> >>that we only logged the unsetting of all_visible via
> >>heap_(inset|update|delete)'s wal records not the setting as far as I can
> >>tell. So I don't immediately see a danger< 9.2.
> >
> >OK. I have committed this. For 9.2, I had to backport
> >log_newpage_buffer() and use XLByteEQ rather than ==.
>
> I'm afraid this patch was a few bricks shy of a load. The
> log_newpage_buffer() function asserts that:
>
> > /* We should be in a critical section. */
> > Assert(CritSectionCount > 0);
>
> But the call in vacuumlazy.c is not inside a critical section. Also, the
> comments in log_newpage_buffer() say that the caller should mark the buffer
> dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's
> marked dirty afterwards. I'm not sure what consequences that might have, but
> at least it contradicts the comment.
>
> (spotted this while working on a patch, and ran into the assertion on crash
> recovery)
What about the attached patches (one for 9.3 and master, the other for
9.2)? I've tested that I can trigger the assert before and not after by
inserting faults...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
HEAD-0001-Use-critical-section-when-ensuring-empty-pages-are-i.patch | text/x-patch | 1.8 KB |
9.2-0001-Use-critical-section-when-ensuring-empty-pages-are-i.patch | text/x-patch | 1.8 KB |
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-10-22 11:14:55 |
Message-ID: | 20131022111455.GA7435@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi Robert, Heikki,
On 2013-09-24 13:25:41 +0200, Andres Freund wrote:
> > I'm afraid this patch was a few bricks shy of a load. The
> > log_newpage_buffer() function asserts that:
> >
> > > /* We should be in a critical section. */
> > > Assert(CritSectionCount > 0);
> >
> > But the call in vacuumlazy.c is not inside a critical section. Also, the
> > comments in log_newpage_buffer() say that the caller should mark the buffer
> > dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's
> > marked dirty afterwards. I'm not sure what consequences that might have, but
> > at least it contradicts the comment.
> >
> > (spotted this while working on a patch, and ran into the assertion on crash
> > recovery)
>
> What about the attached patches (one for 9.3 and master, the other for
> 9.2)? I've tested that I can trigger the assert before and not after by
> inserting faults...
Yould either of you commit those patches to the corresponding branches?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-10-23 11:33:19 |
Message-ID: | 5267B3FF.1040008@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 22.10.2013 14:14, Andres Freund wrote:
> Hi Robert, Heikki,
>
> On 2013-09-24 13:25:41 +0200, Andres Freund wrote:
>>> I'm afraid this patch was a few bricks shy of a load. The
>>> log_newpage_buffer() function asserts that:
>>>
>>>> /* We should be in a critical section. */
>>>> Assert(CritSectionCount> 0);
>>>
>>> But the call in vacuumlazy.c is not inside a critical section. Also, the
>>> comments in log_newpage_buffer() say that the caller should mark the buffer
>>> dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's
>>> marked dirty afterwards. I'm not sure what consequences that might have, but
>>> at least it contradicts the comment.
>>>
>>> (spotted this while working on a patch, and ran into the assertion on crash
>>> recovery)
>>
>> What about the attached patches (one for 9.3 and master, the other for
>> 9.2)? I've tested that I can trigger the assert before and not after by
>> inserting faults...
>
> Yould either of you commit those patches to the corresponding branches?
Committed, thanks.
- Heikki
From: | Noah Yetter <nyetter(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-11-11 00:40:31 |
Message-ID: | CAPuoA+nYm_DHtBcFsPXkq3wKR3kv9yb3H4VD8JVW1sjH4kBpPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Like your customer, this bug has blown up my standby servers, twice in the
last month: the first time all 4 replicas, the second time (mysteriously
but luckily) only 1 of them.
At any rate, since the fix isn't available yet, is/are there any
configuration changes that can be made or maintenance procedures that can
be undertaken to prevent or reduce the probability of this bug popping up
again in the meantime? I really can't afford to be without my standby
servers during the holidays, even for the few hours it takes to build a new
one.
On Tue, May 28, 2013 at 11:58 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
> Hi,
>
> A customer of ours reporting a standby loosing sync with the primary due
> to the following error:
> CONTEXT: xlog redo visible: rel 1663/XXX/XXX; blk 173717
> WARNING: page 173717 of relation base/XXX/XXX is uninitialized
> ...
> PANIC: WAL contains references to invalid pages
>
> Guessing around I looked and noticed the following problematic pattern:
> 1) A: wants to do an update, doesn't have enough freespace
> 2) A: extends the relation on the filesystem level
> (RelationGetBufferForTuple)
> 3) A: does PageInit (RelationGetBufferForTuple)
> 4) A: aborts, e.g. due to a serialization failure (heap_update)
>
> At this point the page is initialized in memory, but not wal logged. It
> isn't pinned or locked either.
>
> 5) B: vacuum finds that page and it's empty. So it marks it all
> visible. But since the page wasn't written out (we haven't even marked
> it dirty in 3.) the standby doesn't know that and reports the page as
> being uninitialized.
>
> ISTM the best backbranchable fix for this is to teach lazy_scan_heap to
> log an FPI for the heap page via visibilitymap_set in that rather
> limited case.
>
> Happy to provide a patch unless somebody has a better idea?
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Noah Yetter <nyetter(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-11-11 00:42:07 |
Message-ID: | 20131111004207.GK32665@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
On 2013-11-10 17:40:31 -0700, Noah Yetter wrote:
> Like your customer, this bug has blown up my standby servers, twice in the
> last month: the first time all 4 replicas, the second time (mysteriously
> but luckily) only 1 of them.
>
> At any rate, since the fix isn't available yet, is/are there any
> configuration changes that can be made or maintenance procedures that can
> be undertaken to prevent or reduce the probability of this bug popping up
> again in the meantime? I really can't afford to be without my standby
> servers during the holidays, even for the few hours it takes to build a new
> one.
The fix is included in 9.2.5, it's just not noted in the release notes.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com> |
Cc: | Noah Yetter <nyetter(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-11-11 19:26:49 |
Message-ID: | 20131111192649.GB15562@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Nov 11, 2013 at 01:42:07AM +0100, Andres Freund wrote:
> Hi,
>
> On 2013-11-10 17:40:31 -0700, Noah Yetter wrote:
> > Like your customer, this bug has blown up my standby servers, twice in the
> > last month: the first time all 4 replicas, the second time (mysteriously
> > but luckily) only 1 of them.
> >
> > At any rate, since the fix isn't available yet, is/are there any
> > configuration changes that can be made or maintenance procedures that can
> > be undertaken to prevent or reduce the probability of this bug popping up
> > again in the meantime? I really can't afford to be without my standby
> > servers during the holidays, even for the few hours it takes to build a new
> > one.
>
> The fix is included in 9.2.5, it's just not noted in the release notes.
Yes, I missed it because I didn't understand the importance of these
commit messages:
commit 17fa4c321ccf9693de406faffe6b235e949aa25f
Author: Robert Haas <rhaas(at)postgresql(dot)org>
Date: Thu Jun 6 10:15:45 2013 -0400
Ensure that XLOG_HEAP2_VISIBLE always targets an initialized page.
Andres Freund
commit 4c641d994e19676ef2fec574d52d2156ffc2b3ce
Author: Robert Haas <rhaas(at)postgresql(dot)org>
Date: Thu Jun 6 10:14:46 2013 -0400
Backport log_newpage_buffer.
Andres' fix for XLOG_HEAP2_VISIBLE on unitialized pages requires
this.
Sorry.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Yetter <nyetter(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: all_visible replay aborting due to uninitialized pages |
Date: | 2013-11-11 19:54:16 |
Message-ID: | 12061.1384199656@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Mon, Nov 11, 2013 at 01:42:07AM +0100, Andres Freund wrote:
>> The fix is included in 9.2.5, it's just not noted in the release notes.
> Yes, I missed it because I didn't understand the importance of these
> commit messages:
> commit 17fa4c321ccf9693de406faffe6b235e949aa25f
> Author: Robert Haas <rhaas(at)postgresql(dot)org>
> Date: Thu Jun 6 10:15:45 2013 -0400
> Ensure that XLOG_HEAP2_VISIBLE always targets an initialized page.
> Andres Freund
I would say that's not your fault, it's the fault of the author of the
commit message. These messages are supposed to provide enough information
for release note writing. This one seems a bit ... um ... terse.
regards, tom lane