Re: Inadequate thought about buffer locking during hot standby replay

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: daniel(at)heroku(dot)com
Subject: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-09 23:24:25
Message-ID: 21405.1352503465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

During normal running, operations such as btree page splits are
extremely careful about the order in which they acquire and release
buffer locks, if they're doing something that concurrently modifies
multiple pages.

During WAL replay, that all goes out the window. Even if an individual
WAL-record replay function does things in the right order for "standard"
cases, RestoreBkpBlocks has no idea what it's doing. So if one or more
of the referenced pages gets treated as a full-page image, we are left
with no guarantee whatsoever about what order the pages are restored in.
That never mattered when the code was originally designed, but it sure
matters during Hot Standby when other queries might be able to see the
intermediate states.

I can't prove that this is the cause of bug #7648, but it's fairly easy
to see that it could explain the symptom. You only need to assume that
the page-being-split had been handled as a full-page image, and that the
new right-hand page had gotten allocated by extending the relation.
Then there will be an interval just after RestoreBkpBlocks does its
thing where the updated left-hand sibling is in the index and is not
locked in any way, but its right-link points off the end of the index.
If a few indexscans come along before the replay process gets to
continue, you'd get exactly the reported errors.

I'm inclined to think that we need to fix this by getting rid of
RestoreBkpBlocks per se, and instead having the per-WAL-record restore
routines dictate when each full-page image is restored (and whether or
not to release the buffer lock immediately). That's not going to be a
small change unfortunately :-(

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, daniel(at)heroku(dot)com
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-09 23:42:52
Message-ID: 20121109234252.GD16999@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
> I can't prove that this is the cause of bug #7648, but it's fairly easy
> to see that it could explain the symptom. You only need to assume that
> the page-being-split had been handled as a full-page image, and that the
> new right-hand page had gotten allocated by extending the relation.
> Then there will be an interval just after RestoreBkpBlocks does its
> thing where the updated left-hand sibling is in the index and is not
> locked in any way, but its right-link points off the end of the index.
> If a few indexscans come along before the replay process gets to
> continue, you'd get exactly the reported errors.

Sounds plausible.

> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately). That's not going to be a
> small change unfortunately :-(

I wonder if we couldn't instead "fix" it by ensuring the backup blocks
are in the right order in the backup blocks at the inserting
location. That would "just" need some care about the order of
XLogRecData blocks.
I am pretty unfamiliar with the nbtree locking but I seem to remember
that we should be fine if we always restore from left to right?

Greetings,

Andres Freund


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-09 23:50:22
Message-ID: CAAZKuFZUSEOVUumMut9Ejeo07yTk0Dna+PV0vnd_3ut9XZDThg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> During normal running, operations such as btree page splits are
> extremely careful about the order in which they acquire and release
> buffer locks, if they're doing something that concurrently modifies
> multiple pages.
>
> During WAL replay, that all goes out the window. Even if an individual
> WAL-record replay function does things in the right order for "standard"
> cases, RestoreBkpBlocks has no idea what it's doing. So if one or more
> of the referenced pages gets treated as a full-page image, we are left
> with no guarantee whatsoever about what order the pages are restored in.
> That never mattered when the code was originally designed, but it sure
> matters during Hot Standby when other queries might be able to see the
> intermediate states.
>
> I can't prove that this is the cause of bug #7648,

(I was the reporter of 7648)

To lend slightly more circumstantial evidence in support of this, I
also happened to note that the relfile in question was the last
segment and it was about a quarter full, so the access attempt was
definitely at the extreme outermost edge of the index most generally.

--
fdr


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgreSQL(dot)org, daniel(at)heroku(dot)com
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 00:14:48
Message-ID: 22443.1352506488@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately). That's not going to be a
>> small change unfortunately :-(

> I wonder if we couldn't instead "fix" it by ensuring the backup blocks
> are in the right order in the backup blocks at the inserting
> location. That would "just" need some care about the order of
> XLogRecData blocks.

I don't think that's a good way to go. In the first place, if we did
that the fix would require incompatible changes in the contents of WAL
streams. In the second place, there are already severe constraints on
the positioning of backup blocks to ensure that WAL records can be
uniquely decoded (the section of access/transam/README about WAL coding
touches on this) --- I don't think it's a good plan to add still more
constraints there. And in the third place, the specific problem we're
positing here results from a failure to hold the buffer lock for a
full-page image until after we're done restoring a *non* full-page image
represented elsewhere in the same WAL record. In general, of the set
of pages touched by a WAL record, any arbitrary subset of them might be
converted to FPIs during XLogInsert; but the replay-time locking
requirements are going to be the same regardless of that. So AFAICS,
any design in which RestoreBkpBlocks acts independently of the
non-full-page-image updates is just broken.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, daniel(at)heroku(dot)com
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 17:05:31
Message-ID: CA+U5nM+p6ze4PMd6YvcMDrVru01_OcYZ=43T7EzsoqQgEpt8eQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9 November 2012 23:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> During normal running, operations such as btree page splits are
> extremely careful about the order in which they acquire and release
> buffer locks, if they're doing something that concurrently modifies
> multiple pages.
>
> During WAL replay, that all goes out the window. Even if an individual
> WAL-record replay function does things in the right order for "standard"
> cases, RestoreBkpBlocks has no idea what it's doing. So if one or more
> of the referenced pages gets treated as a full-page image, we are left
> with no guarantee whatsoever about what order the pages are restored in.
> That never mattered when the code was originally designed, but it sure
> matters during Hot Standby when other queries might be able to see the
> intermediate states.
>
> I can't prove that this is the cause of bug #7648, but it's fairly easy
> to see that it could explain the symptom. You only need to assume that
> the page-being-split had been handled as a full-page image, and that the
> new right-hand page had gotten allocated by extending the relation.
> Then there will be an interval just after RestoreBkpBlocks does its
> thing where the updated left-hand sibling is in the index and is not
> locked in any way, but its right-link points off the end of the index.
> If a few indexscans come along before the replay process gets to
> continue, you'd get exactly the reported errors.
>
> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately). That's not going to be a
> small change unfortunately :-(

No, but it looks like a clear bug scenario and a clear resolution also.

I'll start looking at it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, daniel(at)heroku(dot)com
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 17:41:27
Message-ID: 24768.1352569287@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 9 November 2012 23:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately). That's not going to be a
>> small change unfortunately :-(

> No, but it looks like a clear bug scenario and a clear resolution also.
> I'll start looking at it.

I'm on it already.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 18:16:46
Message-ID: 25470.1352571406@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I'm inclined to think that we need to fix this by getting rid of
> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
> routines dictate when each full-page image is restored (and whether or
> not to release the buffer lock immediately). That's not going to be a
> small change unfortunately :-(

Here's a WIP patch that attacks it in this way. I've only gone as far as
fixing btree_xlog_split() for the moment, since the main point is to see
whether the coding change seems reasonable. At least in this function,
it seems that locking considerations are handled correctly as long as no
full-page image is used, so it's pretty straightforward to tweak it to
handle the case with full-page image(s) as well. I've not looked
through any other replay functions yet --- some of them may need more
invasive hacking. But so far this looks pretty good.

Note that this patch isn't correct yet even for btree_xlog_split(),
because I've not removed the RestoreBkpBlocks() call in btree_redo().
All the btree replay routines will have to get fixed before it's
testable at all.

One thing that seems a bit annoying is the use of zero-based backup
block indexes in RestoreBackupBlock, when most (not all) of the callers
are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
That's a bug waiting to happen. We could address it by changing
RestoreBackupBlock to accept a one-based argument, but what I'm thinking
of doing instead is getting rid of the one-based convention entirely;
that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One
advantage of doing that is that it'll force inspection of all code
related to this.

Comments, opinions?

regards, tom lane

Attachment Content-Type Size
restore-backup-blocks-individually-0.patch text/x-patch 10.7 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 20:59:13
Message-ID: CA+U5nMLAuMzfOd4eFmU-MBGWFey8frJkYGXMiqy-OMwTFthfMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 November 2012 18:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I'm inclined to think that we need to fix this by getting rid of
>> RestoreBkpBlocks per se, and instead having the per-WAL-record restore
>> routines dictate when each full-page image is restored (and whether or
>> not to release the buffer lock immediately). That's not going to be a
>> small change unfortunately :-(
>
> Here's a WIP patch that attacks it in this way. I've only gone as far as
> fixing btree_xlog_split() for the moment, since the main point is to see
> whether the coding change seems reasonable. At least in this function,
> it seems that locking considerations are handled correctly as long as no
> full-page image is used, so it's pretty straightforward to tweak it to
> handle the case with full-page image(s) as well. I've not looked
> through any other replay functions yet --- some of them may need more
> invasive hacking. But so far this looks pretty good.

Looks fine, but we need a top-level comment in btree code explaining
why/how it follows the very well explained rules in the README.

> Note that this patch isn't correct yet even for btree_xlog_split(),
> because I've not removed the RestoreBkpBlocks() call in btree_redo().
> All the btree replay routines will have to get fixed before it's
> testable at all.

I was just about to write you back you missed that...

> One thing that seems a bit annoying is the use of zero-based backup
> block indexes in RestoreBackupBlock, when most (not all) of the callers
> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
> That's a bug waiting to happen. We could address it by changing
> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
> of doing instead is getting rid of the one-based convention entirely;
> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One
> advantage of doing that is that it'll force inspection of all code
> related to this.

I wouldn't do that in a back branch, but I can see why its a good idea.

Thanks for fixing.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 21:24:22
Message-ID: 1448.1352582662@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 10 November 2012 18:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Here's a WIP patch that attacks it in this way.

> Looks fine, but we need a top-level comment in btree code explaining
> why/how it follows the very well explained rules in the README.

Not sure what you'd want it to say exactly? Really these issues are per
WAL-replay function, not global. I've now been through all of the btree
functions, and AFAICS btree_xlog_split is the only one of them that
actually has a bug; the others can be a bit lax about the order in which
things get done.

>> One thing that seems a bit annoying is the use of zero-based backup
>> block indexes in RestoreBackupBlock, when most (not all) of the callers
>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
>> That's a bug waiting to happen. We could address it by changing
>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
>> of doing instead is getting rid of the one-based convention entirely;
>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One
>> advantage of doing that is that it'll force inspection of all code
>> related to this.

> I wouldn't do that in a back branch, but I can see why its a good idea.

If any of this stuff were getting used by external modules, changing it
would be problematic ... but fortunately, it isn't, because we lack
support for plug-in rmgrs. So I'm not worried about back-patching the
change, and would prefer to keep the 9.x branches in sync.

Attached is an updated patch in which nbtxlog.c has been fully converted,
but I've not touched other rmgrs yet. I've tested this to the extent of
having a replication slave follow a master that's running the regression
tests, and it seems to work. It's difficult to prove much by testing
about concurrent behavior, of course.

I found that there's another benefit of managing backup-block replay
this way, which is that we can de-contort the handling of conflict
resolution by moving it into the per-record-type subroutines, instead of
needing an extra switch before the RestoreBkpBlocks call. So that gives
me more confidence that this is a good way to go.

regards, tom lane

Attachment Content-Type Size
restore-backup-blocks-individually-1.patch text/x-patch 30.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-10 21:37:43
Message-ID: 1727.1352583463@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While I'm looking at this: there seems to be a bug/inconsistency in
heap_xlog_freeze(). It uses a cleanup lock in the "normal" code path,
but it doesn't tell RestoreBkpBlocks to use a cleanup lock. Which
choice is correct?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-11 07:43:44
Message-ID: CA+U5nMLXFm5fR5AaBVNSvnEL0kE3dT_zuufptXOiiJsGTqgt0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 November 2012 21:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On 10 November 2012 18:16, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Here's a WIP patch that attacks it in this way.
>
>> Looks fine, but we need a top-level comment in btree code explaining
>> why/how it follows the very well explained rules in the README.
>
> Not sure what you'd want it to say exactly? Really these issues are per
> WAL-replay function, not global. I've now been through all of the btree
> functions, and AFAICS btree_xlog_split is the only one of them that
> actually has a bug; the others can be a bit lax about the order in which
> things get done.
>
>>> One thing that seems a bit annoying is the use of zero-based backup
>>> block indexes in RestoreBackupBlock, when most (not all) of the callers
>>> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
>>> That's a bug waiting to happen. We could address it by changing
>>> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
>>> of doing instead is getting rid of the one-based convention entirely;
>>> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
>>> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One
>>> advantage of doing that is that it'll force inspection of all code
>>> related to this.
>
>> I wouldn't do that in a back branch, but I can see why its a good idea.
>
> If any of this stuff were getting used by external modules, changing it
> would be problematic ... but fortunately, it isn't, because we lack
> support for plug-in rmgrs. So I'm not worried about back-patching the
> change, and would prefer to keep the 9.x branches in sync.
>
> Attached is an updated patch in which nbtxlog.c has been fully converted,
> but I've not touched other rmgrs yet. I've tested this to the extent of
> having a replication slave follow a master that's running the regression
> tests, and it seems to work. It's difficult to prove much by testing
> about concurrent behavior, of course.
>
> I found that there's another benefit of managing backup-block replay
> this way, which is that we can de-contort the handling of conflict
> resolution by moving it into the per-record-type subroutines, instead of
> needing an extra switch before the RestoreBkpBlocks call. So that gives
> me more confidence that this is a good way to go.

It's a good way to go, I'm just glad you're handling it for the back
branches ;-)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-11 23:24:23
Message-ID: 8807.1352676263@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Attached is a complete draft patch against HEAD for this issue. I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type. In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.) Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks. Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes. There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s). If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action. This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already. This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed. As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code? I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane

#application/octet-stream [restore-backup-blocks-individually-2.patch.gz] /home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-11 23:25:57
Message-ID: 8859.1352676357@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

[ this time *with* the patch ]

Attached is a complete draft patch against HEAD for this issue. I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type. In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.) Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks. Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes. There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s). If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action. This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already. This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed. As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code? I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane

Attachment Content-Type Size
restore-backup-blocks-individually-2.patch.gz application/octet-stream 17.1 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 01:12:47
Message-ID: 20121112011246.GA2144@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >> One thing that seems a bit annoying is the use of zero-based backup
> >> block indexes in RestoreBackupBlock, when most (not all) of the callers
> >> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
> >> That's a bug waiting to happen. We could address it by changing
> >> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
> >> of doing instead is getting rid of the one-based convention entirely;
> >> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
> >> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based. One
> >> advantage of doing that is that it'll force inspection of all code
> >> related to this.
>
> > I wouldn't do that in a back branch, but I can see why its a good idea.
>
> If any of this stuff were getting used by external modules, changing it
> would be problematic ... but fortunately, it isn't, because we lack
> support for plug-in rmgrs. So I'm not worried about back-patching the
> change, and would prefer to keep the 9.x branches in sync.

XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
used by xlogdump. Not sure if either are worth that much attention, but
it seems worth noticing that such a change will break stuff.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 08:22:59
Message-ID: 50A0B1E3.5050902@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12.11.2012 01:25, Tom Lane wrote:
> Worse than that, GIST WAL replay seems to be a total disaster from this
> standpoint, and I'm not convinced that it's even fixable without
> incompatible WAL changes. There are several cases where index
> insertion operations generate more than one WAL record while holding
> exclusive lock on buffer(s). If the lock continuity is actually
> necessary to prevent concurrent queries from seeing inconsistent index
> states, then we have a big problem, because WAL replay isn't designed to
> hold any buffer lock for more than one WAL action.

Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0),
the GiST index is always in a consistent state. Before the downlink for
a newly split page has been inserted yet, its left sibling is flagged so
that a search knows to follow the right-link to find it. In normal
operation, the lock continuity is needed to prevent another backend from
seeing the incomplete split and trying to fix it, but in hot standby, we
never try to fix incomplete splits anyway.

So I think we're good on >= 9.1. The 9.0 code is broken, however. In
9.0, when a child page is split, the parent and new children are kept
locked until the downlinks are inserted/updated. If a concurrent scan
comes along and sees that incomplete state, it will miss tuples on the
new right siblings. We rely on a rm_cleanup operation at the end of WAL
replay to fix that situation, if the downlink insertion record is not
there. I don't see any easy way to fix that, unfortunately. Perhaps we
could backpatch the 9.1 rewrite, now that it's gotten some real-world
testing, but it was a big change so I don't feel very comfortable doing
that.

> Also, gistRedoPageDeleteRecord seems to be dead code? I don't see
> anything that emits XLOG_GIST_PAGE_DELETE.

Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was
removed.

- Heikki


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 14:23:45
Message-ID: CA+Tgmob6NS=jhz=y6NYnfqmfW5y6ten5YXi3WMOc9v3vwgkMCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I already pointed out the inconsistency in heap_xlog_freeze about
> whether a cleanup lock is needed. As is, this patch uses a cleanup
> lock, but I suspect that a regular lock is sufficient --- comments?

Well, according to storage/buffer/README:

1. To scan a page for tuples, one must hold a pin and either shared or
exclusive content lock. To examine the commit status (XIDs and status bits)
of a tuple in a shared buffer, one must likewise hold a pin and either shared
or exclusive lock.

That does indeed make it sound like an x-lock is enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 14:36:18
Message-ID: 13323.1352730978@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Nov 11, 2012 at 6:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I already pointed out the inconsistency in heap_xlog_freeze about
>> whether a cleanup lock is needed. As is, this patch uses a cleanup
>> lock, but I suspect that a regular lock is sufficient --- comments?

> Well, according to storage/buffer/README:

> 1. To scan a page for tuples, one must hold a pin and either shared or
> exclusive content lock. To examine the commit status (XIDs and status bits)
> of a tuple in a shared buffer, one must likewise hold a pin and either shared
> or exclusive lock.

> That does indeed make it sound like an x-lock is enough.

Yeah. AFAICS, the only possible downside is that somebody might be
using the tuple (while holding just a buffer pin), and that its xmin
might change while that's happening. So for instance a nestloop join
might read out different xmin values for the same row while the join
proceeds. But that could happen anyway if a different plan had been
chosen (viz, this table on the inside not the outside of the nestloop).
The xmin update ought to be physically atomic, so you shouldn't be able
to get a garbage result, just the old value or the new.

The cleanup lock is needed for cases where we'd like to remove or move a
tuple, but that's not required here.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 14:51:30
Message-ID: 13637.1352731890@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 12.11.2012 01:25, Tom Lane wrote:
>> Worse than that, GIST WAL replay seems to be a total disaster from this
>> standpoint, and I'm not convinced that it's even fixable without
>> incompatible WAL changes.

> Hmm. After the rewrite of that logic I did in 9.1 (commit 9de3aa65f0),
> the GiST index is always in a consistent state. Before the downlink for
> a newly split page has been inserted yet, its left sibling is flagged so
> that a search knows to follow the right-link to find it. In normal
> operation, the lock continuity is needed to prevent another backend from
> seeing the incomplete split and trying to fix it, but in hot standby, we
> never try to fix incomplete splits anyway.

> So I think we're good on >= 9.1.

Okay. I'll update the patch to make sure that the individual WAL replay
functions hold all locks, but not worry about holding locks across
actions.

> The 9.0 code is broken, however. In
> 9.0, when a child page is split, the parent and new children are kept
> locked until the downlinks are inserted/updated. If a concurrent scan
> comes along and sees that incomplete state, it will miss tuples on the
> new right siblings. We rely on a rm_cleanup operation at the end of WAL
> replay to fix that situation, if the downlink insertion record is not
> there. I don't see any easy way to fix that, unfortunately. Perhaps we
> could backpatch the 9.1 rewrite, now that it's gotten some real-world
> testing, but it was a big change so I don't feel very comfortable doing
> that.

Me either. Given the lack of field complaints, I think we're better
advised to just leave it unfixed in 9.0. It'd not be a step forward
if we broke something trying to make this work.

>> Also, gistRedoPageDeleteRecord seems to be dead code? I don't see
>> anything that emits XLOG_GIST_PAGE_DELETE.

> Yep. It was used in VACUUM FULL, but has been dead since VACUUM FULL was
> removed.

Okay. I see we bumped XLOG_PAGE_MAGIC in 9.0, so there's no longer
any way that 9.0 or later versions could see this WAL record type.
I'll delete that replay function rather than bothering to fix it.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 15:19:09
Message-ID: 14264.1352733549@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
>> If any of this stuff were getting used by external modules, changing it
>> would be problematic ... but fortunately, it isn't, because we lack
>> support for plug-in rmgrs. So I'm not worried about back-patching the
>> change, and would prefer to keep the 9.x branches in sync.

> XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> used by xlogdump. Not sure if either are worth that much attention, but
> it seems worth noticing that such a change will break stuff.

Hm. Okay, how about we leave the old macros in place in the back
branches?

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 15:42:10
Message-ID: 20121112154210.GA10179@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-11-12 10:19:09 -0500, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> >> If any of this stuff were getting used by external modules, changing it
> >> would be problematic ... but fortunately, it isn't, because we lack
> >> support for plug-in rmgrs. So I'm not worried about back-patching the
> >> change, and would prefer to keep the 9.x branches in sync.
>
> > XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
> > used by xlogdump. Not sure if either are worth that much attention, but
> > it seems worth noticing that such a change will break stuff.
>
> Hm. Okay, how about we leave the old macros in place in the back
> branches?

Sounds good to me. The RestoreBkpBlocks change seems unproblematic to
me. If anything its good that it has been renamed.

Thanks,

Andres


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 19:39:09
Message-ID: CA+U5nM+ThXPJfxZnDkCJHkY-2iA05XMmWm4UWdJeMAwNMuz3oQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 November 2012 23:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Practically all WAL record types that touch multiple pages have some
> bug of this type. In addition to btree_xlog_split, I found that
> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
> locks as required to make their updates safe for concurrent queries.
> (I'm not totally sure about ginRedoDeletePage, but the original action
> definitely locks the pages simultaneously, and it's not clear that it's
> safe not to.) Most of these are okay in cases without any full-page
> images, but could fail if the wrong subset of the pages-to-be-touched
> were processed by RestoreBkpBlocks. Some had bugs even without that :-(

Hmm, not good. Thanks for spotting.

Do these changes do anything to actions that occur across multiple
records? I assume not and think those are OK, agreed?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 19:40:53
Message-ID: CA+U5nMLjiR_+32KX_F46heLDaFyZtPzCtKDFRoKNqLBMhZuWTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 November 2012 14:51, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> The 9.0 code is broken, however. In
>> 9.0, when a child page is split, the parent and new children are kept
>> locked until the downlinks are inserted/updated. If a concurrent scan
>> comes along and sees that incomplete state, it will miss tuples on the
>> new right siblings. We rely on a rm_cleanup operation at the end of WAL
>> replay to fix that situation, if the downlink insertion record is not
>> there. I don't see any easy way to fix that, unfortunately. Perhaps we
>> could backpatch the 9.1 rewrite, now that it's gotten some real-world
>> testing, but it was a big change so I don't feel very comfortable doing
>> that.
>
> Me either. Given the lack of field complaints, I think we're better
> advised to just leave it unfixed in 9.0. It'd not be a step forward
> if we broke something trying to make this work.

Agreed. Most people running 9.0 with GIST indexes have already upgraded.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 19:59:21
Message-ID: 3269.1352750361@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On 11 November 2012 23:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Practically all WAL record types that touch multiple pages have some
>> bug of this type. In addition to btree_xlog_split, I found that
>> heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
>> spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
>> locks as required to make their updates safe for concurrent queries.
>> (I'm not totally sure about ginRedoDeletePage, but the original action
>> definitely locks the pages simultaneously, and it's not clear that it's
>> safe not to.) Most of these are okay in cases without any full-page
>> images, but could fail if the wrong subset of the pages-to-be-touched
>> were processed by RestoreBkpBlocks. Some had bugs even without that :-(

> Hmm, not good. Thanks for spotting.

> Do these changes do anything to actions that occur across multiple
> records? I assume not and think those are OK, agreed?

Right, we were and still are assuming that any state that exists between
WAL records is consistent and safe to expose to hot-standby queries.
The important thing here is that these WAL replay functions were failing
to ensure that their own actions appear atomic to onlookers. This is
basically hangover from pre-hot-standby coding conventions, when no such
concern existed.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-12 20:53:43
Message-ID: 5730.1352753623@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's an updated patch that fixes the GIST replay functions as well as
the other minor issues that were mentioned. Barring objections, I'll
set about back-patching this as far as 9.0.

One thing that could use verification is my fix for
gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL
record is always the "original" page, and the ones following it are
pages that were split off from it, and can (as yet) only be reached by
following right-links from the "original" page. As such, it should be
okay to release locks on the non-first pages as soon as we've written
them. We have to hold lock on the original page though to avoid letting
readers follow dangling right-links. Also, the update of
NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
with all this, so that has to be done before releasing the original-page
lock as well. Does that sound right?

regards, tom lane

Attachment Content-Type Size
restore-backup-blocks-individually-3.patch.gz application/octet-stream 19.4 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-13 08:49:47
Message-ID: 50A209AB.8080804@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12.11.2012 22:53, Tom Lane wrote:
> Here's an updated patch that fixes the GIST replay functions as well as
> the other minor issues that were mentioned. Barring objections, I'll
> set about back-patching this as far as 9.0.

Ok. It won't help all that much on 9.0, though.

> One thing that could use verification is my fix for
> gistRedoPageSplitRecord. AFAICS, the first page listed in the WAL
> record is always the "original" page, and the ones following it are
> pages that were split off from it, and can (as yet) only be reached by
> following right-links from the "original" page. As such, it should be
> okay to release locks on the non-first pages as soon as we've written
> them. We have to hold lock on the original page though to avoid letting
> readers follow dangling right-links. Also, the update of
> NSN/FOLLOW_RIGHT on the child page (if any) has to be done atomically
> with all this, so that has to be done before releasing the original-page
> lock as well. Does that sound right?

Yep.

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-13 15:03:59
Message-ID: 18352.1352819039@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 12.11.2012 22:53, Tom Lane wrote:
>> Here's an updated patch that fixes the GIST replay functions as well as
>> the other minor issues that were mentioned. Barring objections, I'll
>> set about back-patching this as far as 9.0.

> Ok. It won't help all that much on 9.0, though.

Well, it won't help GIST much, but the actually-reported-from-the-field
case is in btree, and it does fix that.

It occurs to me that if we're sufficiently scared of this case, we could
probably hack the planner (in 9.0 only) to refuse to use GIST indexes
in hot-standby queries. That cure might be worse than the disease though.

regards, tom lane


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-13 15:32:33
Message-ID: CAHyXU0ww4=wTK34Me4iBbhOPkFVFLKR2DjnXAZ4ZnhXvQFeOuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ok. It won't help all that much on 9.0, though.
>
> Well, it won't help GIST much, but the actually-reported-from-the-field
> case is in btree, and it does fix that.
>
> It occurs to me that if we're sufficiently scared of this case, we could
> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
> in hot-standby queries. That cure might be worse than the disease though.

if anything, it should be documented. if you do this kind of thing
people will stop installing bugfix releases.

merlin


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-13 18:23:38
Message-ID: 50A2902A.1020209@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14/11/12 04:32, Merlin Moncure wrote:
> On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Ok. It won't help all that much on 9.0, though.
>> Well, it won't help GIST much, but the actually-reported-from-the-field
>> case is in btree, and it does fix that.
>>
>> It occurs to me that if we're sufficiently scared of this case, we could
>> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
>> in hot-standby queries. That cure might be worse than the disease though.
> if anything, it should be documented. if you do this kind of thing
> people will stop installing bugfix releases.
>
> merlin
>
>
How about displaying a warning, when people try to use the 'feature', as
well as document it?

Cheers,
Gavin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Inadequate thought about buffer locking during hot standby replay
Date: 2012-11-13 18:43:34
Message-ID: CA+TgmoZ3dHfqTp2xbXjx6Fxhfu-kv6vn=NB-5mvYk_gN_MyCiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 13, 2012 at 10:32 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Nov 13, 2012 at 9:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Ok. It won't help all that much on 9.0, though.
>>
>> Well, it won't help GIST much, but the actually-reported-from-the-field
>> case is in btree, and it does fix that.
>>
>> It occurs to me that if we're sufficiently scared of this case, we could
>> probably hack the planner (in 9.0 only) to refuse to use GIST indexes
>> in hot-standby queries. That cure might be worse than the disease though.
>
> if anything, it should be documented. if you do this kind of thing
> people will stop installing bugfix releases.

Agreed. I think doing that in a back-branch release would be
extremely user-hostile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company