index-only scans vs. Hot Standby, round two

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: index-only scans vs. Hot Standby, round two
Date: 2012-04-13 16:33:06
Message-ID: CA+TgmoaRhoYP=13PhkevEXhUDUscu7hjmN4GsRhH=tzGorKNRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Currently, we have a problem with index-only scans in Hot Standby
mode: the xmin horizon on the standby might lag the master, and thus
an index-only scan might mistakenly conclude that no heap fetch is
needed when in fact it is. I suggested that we handle this by
suppressing generation of index-only scan plans in Hot Standby mode,
but Simon, Noah, and Dimitri were arguing that we should instead do
the following, which is now on the open items list:

* Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
so that IndexOnlyScans work on Hot Standby

Ideally, this would also allow us to remove the following hack from
heapgetpage(), which would improve sequential-scan performance for Hot
Standby.

/*
* If the all-visible flag indicates that all tuples on the page are
* visible to everyone, we can skip the per-tuple visibility tests. But
* not in hot standby mode. A tuple that's already visible to all
* transactions in the master might still be invisible to a read-only
* transaction in the standby.
*/
all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

As far as I can see, to make this work, we'd need to have
lazy_scan_heap() compute the latest xmin on any potentially
all-visible page and treat that as a cutoff XID, cancelling queries
with snapshots whose xmins equal or precede that value, just as we
currently do when freezing tuples (cf xlog_heap_freeze). There are
two things to worry about: correctness, and more query cancellations.

In the department of query cancellations, I believe Noah argued
previously that this wasn't really going to cause a problem. And,
indeed, if the master has a mix of inserts, updates, and deletes, then
it seems likely that any recovery conflicts generated this way would
be hitting about the same place in the XID space as the ones caused by
pruning away dead tuples. What will be different, if we go this
route, is that an insert-only master, which right now only generates
conflicts at freezing time, will instead generate conflicts much
sooner, as soon as the relation is vacuumed. I do not use Hot Standby
enough to know whether this is a problem, and I'm not trying to block
this approach, but I do want to make sure that everyone agrees that
it's acceptable before we go do it, because inevitably somebody is
going to get bit. In making our decision, I think we should keep in
mind that we are currently pretty laid-back about marking pages
all-visible, and that's probably something we're going to want to
change over time: for example, we recently discussed the fact that a
page with one dead tuple in it currently needs *two* vacuums to become
all-visible, because only the first vacuum pass is smart enough to
mark things all-visible, and the first heap pass only truncates the
dead tuple to a dead line pointer, so the page isn't all-visible a
that point. When we fix that, which I think we're almost certainly
going to want to do at some point, then that means these conflicts
will occur that much sooner. I think we will likely also want to
consider marking things all-visible on the fly at some point in the
future; for example, if we pull a buffer in for an index-only scan and
dirty it by setting a hint bit, we might want to take the plunge and
mark it all-visible before evicting it, to save effort the next time.
Again, not trying to dissuade anyone, just making sure we've thought
through it enough to avoid being unhappy later. It's worth noting
that none of this will normally matter if hot_standby_feedback=on, so
part of the analysis here may depend on how many people we think have
flipped that switch.

As to correctness, after further review of lazy_scan_heap(), I think
there are some residual bugs here that need to be fixed whatever we
decide to do about the Hot Standby case:

1. I noticed that when we do PageSetAllVisible() today we follow it up
with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a
hint, so I think these should be changed to MarkBufferDirty(), which
shouldn't make much difference given current code, but proposals to
handle hint changes differently than non-hint changes abound, so it's
probably not good to rely on those meaning the same thing forever.

2. More seriously, lazy_scan_heap() releases the buffer lock before
setting the all-visible bit on the heap. This looks just plain wrong
(and it's my fault, to be clear). Somebody else can come along after
we've released the buffer lock and mark the page not-all-visible.
That concurrent process will check the visibility map bit, find it
already clear, and go on its merry way. Then, we set the visibility
map bit and go on our merry way. Now we have a not-all-visible page
with the all-visible bit set in the visibility map. Oops. I think
this probably needs to be rewritten so that lazy_scan_heap() acquires
a pin on the visibility map page before locking the buffer for cleanup
and holds onto that pin until either we exit the range of heap pages
covered by that visibility map page or trigger index vac due to
maintenance_work_mem exhaustion. With that infrastructure in place,
we can set the visibility map bit at the same time we set the
page-level bit without risking holding the buffer lock across a buffer
I/O (we might have to hold the buffer lock across a WAL flush in the
worst case, but that hazard exists in a number of other places as well
and fixing it here is well out of scope).

Now, suppose we fix the above bugs and also add logic to generate
query-conflicts as described above. How far can the standby now trust
the visibility information? Setting the visibility map bit is a fully
WAL-logged operation, so anyone for whom seeing the bit as set would
potentially be a problem is certain to be killed before they get the
chance to become confused. And when we first open for read-only
connections after a new base backup, the initial snapshot had better
be able to see all XIDs which have committed prior to that point, and
only such things should be marked all-visible, so AFAICT the door is
nailed shut pretty tight here. I am a little less certain about the
page-level bit. On the master, it's possible for the PD_ALL_VISIBLE
bit to make it to disk before the WAL record that sets the visibility
map bit; the LSN interlock only protects the visibility map page
itself, as part of a complicated strategy to avoid emitting FPWs for
the entire heap when we vacuum an insert-only table. On the standby,
those same bits are going to get set when the xlog records covering
the setting of visibility map bit get replayed (which would be OK, in
the sense that the page-level bit would be trustworthy in this case),
or when a copy of the buffer makes it from master to standby by some
other means (which is the potentially problematic case). This could
happen either as part of a base backup, or via a FPW. I don't think
the base backup case is a problem for the same reasons that it's OK
for the visibility map bit case. If an FPW turns the bit on, then
somehow the page-level bit got set without the visibility map bit
getting set. I think the only for that to happen is:

1. vacuum on master sets the page-level bit and the visibility map bit
2. the heap page with the bit is written to disk BEFORE the WAL entry
generated in (1) gets flushed; this is allowable, since it's not an
error for the page-level bit to be set while the visibility-map bit is
unset, only the other way around
3. crash (still before the WAL entry is flushed)
4. some operation on the master emits an FPW for the page without
first rendering it not-all-visible

At present, I'm not sure there's any real problem here, since normally
an all-visible heap page is only going to get another WAL entry if
somebody does an insert, update, or delete on it, in which case the
visibility map bit is going to get cleared anyway. Freezing the page
might emit a new FPW, but that's going to generate conflicts anyway,
so no problem there. So I think there's no real problem here, but it
doesn't seem totally future-proof - any future type of WAL record that
might modify the page without rendering it not-all-visible would
create a very subtle hazard for Hot Standby. We could dodge the whole
issue by leaving the hack in heapgetpage() intact and just be happy
with making index-only scans work, or maybe somebody has a more clever
idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-15 10:02:37
Message-ID: CA+U5nMLY0nz-Y=+cyv9BAqYvWm+S9=zWXms9HAMfCBGgP7uNFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 13, 2012 at 5:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Currently, we have a problem with index-only scans in Hot Standby
> mode: the xmin horizon on the standby might lag the master, and thus
> an index-only scan might mistakenly conclude that no heap fetch is
> needed when in fact it is.  I suggested that we handle this by
> suppressing generation of index-only scan plans in Hot Standby mode,
> but Simon, Noah, and Dimitri were arguing that we should instead do
> the following, which is now on the open items list:
>
> * Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
> so that IndexOnlyScans work on Hot Standby
...
<snip> very long email </snip>

Luckily its much simpler than all of that suggests. It'll take a few
hours for me to write a short reply but its Sunday today, so that will
happen later.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 07:02:56
Message-ID: 20120416070256.GC22182@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
> In the department of query cancellations, I believe Noah argued
> previously that this wasn't really going to cause a problem. And,
> indeed, if the master has a mix of inserts, updates, and deletes, then
> it seems likely that any recovery conflicts generated this way would
> be hitting about the same place in the XID space as the ones caused by
> pruning away dead tuples. What will be different, if we go this
> route, is that an insert-only master, which right now only generates
> conflicts at freezing time, will instead generate conflicts much
> sooner, as soon as the relation is vacuumed. I do not use Hot Standby
> enough to know whether this is a problem, and I'm not trying to block
> this approach, but I do want to make sure that everyone agrees that
> it's acceptable before we go do it, because inevitably somebody is
> going to get bit.

Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
A standby with the GUC "off" would ignore all-visible indicators and also
decline to generate conflicts when flipping them on.

> As to correctness, after further review of lazy_scan_heap(), I think
> there are some residual bugs here that need to be fixed whatever we
> decide to do about the Hot Standby case:
>
> 1. I noticed that when we do PageSetAllVisible() today we follow it up
> with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a
> hint, so I think these should be changed to MarkBufferDirty(), which
> shouldn't make much difference given current code, but proposals to
> handle hint changes differently than non-hint changes abound, so it's
> probably not good to rely on those meaning the same thing forever.

Do you refer to PD_ALL_VISIBLE as "not merely a hint" due to the requirement
to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
positive visibility map bit? That is to say, PD_ALL_VISIBLE is fully a hint
in its role as a witness of tuple visibility, but it is not a hint in its role
as a witness of the visibility map status? In any event, I agree that those
call sites should use MarkBufferDirty().

The large comment in SetBufferCommitInfoNeedsSave() seems incorrect. On
systems with weaker memory ordering, the FlushBuffer() process's clearing of
BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
process until some time after the former retrieved buffer contents from shared
memory. Memory barriers could make the comment true, but we should probably
just update the comment to explain that a race condition may eat the update
entirely. Incidentally, is there a good reason for MarkBufferDirty() to
increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
not to do so? Looks like an oversight.

> 2. More seriously, lazy_scan_heap() releases the buffer lock before
> setting the all-visible bit on the heap. This looks just plain wrong
> (and it's my fault, to be clear). Somebody else can come along after
> we've released the buffer lock and mark the page not-all-visible.
> That concurrent process will check the visibility map bit, find it
> already clear, and go on its merry way. Then, we set the visibility
> map bit and go on our merry way. Now we have a not-all-visible page
> with the all-visible bit set in the visibility map. Oops.

Hmm, indeed. This deserves its own open item.

> I think
> this probably needs to be rewritten so that lazy_scan_heap() acquires
> a pin on the visibility map page before locking the buffer for cleanup
> and holds onto that pin until either we exit the range of heap pages
> covered by that visibility map page or trigger index vac due to
> maintenance_work_mem exhaustion. With that infrastructure in place,
> we can set the visibility map bit at the same time we set the
> page-level bit without risking holding the buffer lock across a buffer
> I/O (we might have to hold the buffer lock across a WAL flush in the
> worst case, but that hazard exists in a number of other places as well
> and fixing it here is well out of scope).

Looks reasonable at a glance.

> 1. vacuum on master sets the page-level bit and the visibility map bit
> 2. the heap page with the bit is written to disk BEFORE the WAL entry
> generated in (1) gets flushed; this is allowable, since it's not an
> error for the page-level bit to be set while the visibility-map bit is
> unset, only the other way around
> 3. crash (still before the WAL entry is flushed)
> 4. some operation on the master emits an FPW for the page without
> first rendering it not-all-visible
>
> At present, I'm not sure there's any real problem here, since normally
> an all-visible heap page is only going to get another WAL entry if
> somebody does an insert, update, or delete on it, in which case the
> visibility map bit is going to get cleared anyway. Freezing the page
> might emit a new FPW, but that's going to generate conflicts anyway,
> so no problem there. So I think there's no real problem here, but it
> doesn't seem totally future-proof - any future type of WAL record that
> might modify the page without rendering it not-all-visible would
> create a very subtle hazard for Hot Standby. We could dodge the whole
> issue by leaving the hack in heapgetpage() intact and just be happy
> with making index-only scans work, or maybe somebody has a more clever
> idea.

Good point. We could finagle things so the standby notices end-of-recovery
checkpoints and blocks the optimization for older snapshots. For example,
maintain an integer count of end-of-recovery checkpoints seen and store that
in each Snapshot instead of takenDuringRecovery; use 0 on a master. When the
global value is greater than the snapshot's value, disable the optimization
for that snapshot. I don't know whether the optimization is powerful enough
to justify that level of trouble, but it's an option to consider.

For a different approach, targeting the fragility, we could add assertions to
detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
full-page image.

Thanks,
nm


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 07:38:49
Message-ID: CA+U5nML5wLmP9SHqKr5meTzW-5riWKg3=+xrV-YZhERUs2EGhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 8:02 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
>> In the department of query cancellations, I believe Noah argued
>> previously that this wasn't really going to cause a problem.  And,
>> indeed, if the master has a mix of inserts, updates, and deletes, then
>> it seems likely that any recovery conflicts generated this way would
>> be hitting about the same place in the XID space as the ones caused by
>> pruning away dead tuples.  What will be different, if we go this
>> route, is that an insert-only master, which right now only generates
>> conflicts at freezing time, will instead generate conflicts much
>> sooner, as soon as the relation is vacuumed.  I do not use Hot Standby
>> enough to know whether this is a problem, and I'm not trying to block
>> this approach, but I do want to make sure that everyone agrees that
>> it's acceptable before we go do it, because inevitably somebody is
>> going to get bit.
>
> Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
> A standby with the GUC "off" would ignore all-visible indicators and also
> decline to generate conflicts when flipping them on.

I'm against adding a new GUC, especially for that fairly thin reason.

>> 1. vacuum on master sets the page-level bit and the visibility map bit
>> 2. the heap page with the bit is written to disk BEFORE the WAL entry
>> generated in (1) gets flushed; this is allowable, since it's not an
>> error for the page-level bit to be set while the visibility-map bit is
>> unset, only the other way around
>> 3. crash (still before the WAL entry is flushed)
>> 4. some operation on the master emits an FPW for the page without
>> first rendering it not-all-visible
>>
>> At present, I'm not sure there's any real problem here, since normally
>> an all-visible heap page is only going to get another WAL entry if
>> somebody does an insert, update, or delete on it, in which case the
>> visibility map bit is going to get cleared anyway.  Freezing the page
>> might emit a new FPW, but that's going to generate conflicts anyway,
>> so no problem there.  So I think there's no real problem here, but it
>> doesn't seem totally future-proof - any future type of WAL record that
>> might modify the page without rendering it not-all-visible would
>> create a very subtle hazard for Hot Standby.  We could dodge the whole
>> issue by leaving the hack in heapgetpage() intact and just be happy
>> with making index-only scans work, or maybe somebody has a more clever
>> idea.
>
> Good point.  We could finagle things so the standby notices end-of-recovery
> checkpoints and blocks the optimization for older snapshots.  For example,
> maintain an integer count of end-of-recovery checkpoints seen and store that
> in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
> global value is greater than the snapshot's value, disable the optimization
> for that snapshot.  I don't know whether the optimization is powerful enough
> to justify that level of trouble, but it's an option to consider.
>
> For a different approach, targeting the fragility, we could add assertions to
> detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
> full-page image.

We don't need a future proof solution, especially not at this stage of
the release cycle. Every time you add a WAL record, we need to
consider the possible conflict impact.

We just need a simple and clear solution. I'll work on that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 08:26:24
Message-ID: 4F8BD7B0.5040802@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16.04.2012 10:38, Simon Riggs wrote:
> On Mon, Apr 16, 2012 at 8:02 AM, Noah Misch<noah(at)leadboat(dot)com> wrote:
>> On Fri, Apr 13, 2012 at 12:33:06PM -0400, Robert Haas wrote:
>>> In the department of query cancellations, I believe Noah argued
>>> previously that this wasn't really going to cause a problem. And,
>>> indeed, if the master has a mix of inserts, updates, and deletes, then
>>> it seems likely that any recovery conflicts generated this way would
>>> be hitting about the same place in the XID space as the ones caused by
>>> pruning away dead tuples. What will be different, if we go this
>>> route, is that an insert-only master, which right now only generates
>>> conflicts at freezing time, will instead generate conflicts much
>>> sooner, as soon as the relation is vacuumed. I do not use Hot Standby
>>> enough to know whether this is a problem, and I'm not trying to block
>>> this approach, but I do want to make sure that everyone agrees that
>>> it's acceptable before we go do it, because inevitably somebody is
>>> going to get bit.
>>
>> Given sufficient doubt, we could add a GUC, say hot_standby_use_all_visible.
>> A standby with the GUC "off" would ignore all-visible indicators and also
>> decline to generate conflicts when flipping them on.
>
> I'm against adding a new GUC, especially for that fairly thin reason.

Same here.

Can we have a "soft" hot standby conflict that doesn't kill the query,
but disables index-only-scans?

In the long run, perhaps we need to store XIDs in the visibility map
instead of just a bit. If you we only stored one xid per 100 pages or
something like that, the storage overhead would not be much higher than
what we have at the moment. But that's obviously not going to happen for
9.2...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 14:19:36
Message-ID: 24054.1334585976@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Can we have a "soft" hot standby conflict that doesn't kill the query,
> but disables index-only-scans?

Well, there wouldn't be any way for the planner to know whether an
index-only scan would be safe or not. I think this would have to look
like a run-time fallback. Could it be structured as "return that the
page's all-visible bit is not set, instead of failing?" Or am I
confused about where the conflict is coming from?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 17:58:03
Message-ID: CA+U5nMKkr0mvqv4Fpjza+xOHwOZx+5NSgMsp2FJBM3zpxKd7sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 3:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Can we have a "soft" hot standby conflict that doesn't kill the query,
>> but disables index-only-scans?
>
> Well, there wouldn't be any way for the planner to know whether an
> index-only scan would be safe or not.  I think this would have to look
> like a run-time fallback.  Could it be structured as "return that the
> page's all-visible bit is not set, instead of failing?"  Or am I
> confused about where the conflict is coming from?

The starting point is that HS now offers 4 different mechanisms for
avoiding conflicts, probably the best of which is hot standby
feedback. So we only need to improve things if those techniques don't
work for people. So initially, my attitude is lets throw a conflict
and wait for feedback during beta.

If we do need to do something, then introduce concept of a visibility conflict.

On replay:
If feedback not set, set LSN of visibility conflict on PROCs that
conflict, if not already set.

On query:
If feedback not set, check conflict LSN against page, if page is
later, check visibility.

In terms of optimisation, I wouldn't expect to have to adjust costs at
all. The difference would only show for long running queries and
typically they don't touch too much new data as a fraction of total.
The cost model for index only is pretty crude anyhow.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 19:56:32
Message-ID: CA+TgmoaPb4mZHTwpFT_gB4ZffbxXj0oHTWt1Qpo2WN4rXgx-eg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 4:26 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Can we have a "soft" hot standby conflict that doesn't kill the query, but
> disables index-only-scans?

Yeah, something like that seems possible.

For example, suppose the master includes, in each
mark-heap-page-all-visible record, the newest XID on the page. On the
standby, we keep track of the most recent such XID ever seen in shared
memory. After noting that a page is all-visible, the standby
cross-checks its snapshot against this XID and does the heap fetch
anyway if it's too new. Conceivably this can be done with memory
barriers but not without locks: we must update the XID in shared
memory *before* marking the page all-visible, and we must check the
visibility map or page-level bit *before* fetching the XID from shared
memory - but the actual reads and writes of the XID are atomic.

Now, if an index-only scan suffers a very high number of these "soft
conflicts" and consequently does a lot more heap fetches than
expected, performance might suck. But that should be rare, and could
be mitigated by turning on hot_standby_feedback. Also, there might be
some hit for repeatedly reading that XID from memory.

Alternatively, we could try to forbid index-only scan plans from being
created in the first place *only when* the underlying snapshot is too
old. But then what happens if a conflict happens later, after the
plan has been created but before it's fully executed? At that point
it's too late to switch gears, so we'd still need something like the
above. And the above might be adequate all by itself, since in
practice index-only scans are mostly going to be useful when the data
isn't changing all that fast, so most of the queries that would be
cancelled by "hard" conflicts wouldn't have actually had a problem
anyway.

> In the long run, perhaps we need to store XIDs in the visibility map instead
> of just a bit. If you we only stored one xid per 100 pages or something like
> that, the storage overhead would not be much higher than what we have at the
> moment. But that's obviously not going to happen for 9.2...

Well, that would also have the undesirable side effect of greatly
reducing the granularity of the map. As it is, updating a tiny
fraction of the tuples in the table could result in the entire table
ceasing to be not-all-visible if you happen to update exactly one
tuple per page through the entire heap. Or more generally, updating
X% of the rows in the table can cause Y% of the rows in the table to
no longer be visible for index-only-scan purposes, where Y >> X. What
you're proposing would make that much worse.

I think we're actually going to find that the current system isn't
tight enough; my suspicion is that users are going to complain that
we're not aggressive enough about marking pages all-visible, because
autovac won't kick in until updates+deletes>20%, but (1) index-only
scans also care about inserts and (2) by the time we've got 20% dead
tuples index-only scans may well be worthless.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 20:02:04
Message-ID: CA+TgmobEJCm=0oXcdeDernCYV4+82yzBttYi1p8UC=tcE+efJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 3:02 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> Do you refer to PD_ALL_VISIBLE as "not merely a hint" due to the requirement
> to prevent a page from simultaneously having a negative PD_ALL_VISIBLE and a
> positive visibility map bit?  That is to say, PD_ALL_VISIBLE is fully a hint
> in its role as a witness of tuple visibility, but it is not a hint in its role
> as a witness of the visibility map status?

Exactly.

> The large comment in SetBufferCommitInfoNeedsSave() seems incorrect.  On
> systems with weaker memory ordering, the FlushBuffer() process's clearing of
> BM_JUST_DIRTIED may not be visible to the SetBufferCommitInfoNeedsSave()
> process until some time after the former retrieved buffer contents from shared
> memory.

True.

> Memory barriers could make the comment true, but we should probably
> just update the comment to explain that a race condition may eat the update
> entirely.

Agreed.

> Incidentally, is there a good reason for MarkBufferDirty() to
> increment pgBufferUsage.shared_blks_dirtied and SetBufferCommitInfoNeedsSave()
> not to do so?  Looks like an oversight.

Also agreed.

>> 2. More seriously, lazy_scan_heap() releases the buffer lock before
>> setting the all-visible bit on the heap.  This looks just plain wrong
>> (and it's my fault, to be clear).  Somebody else can come along after
>> we've released the buffer lock and mark the page not-all-visible.
>> That concurrent process will check the visibility map bit, find it
>> already clear, and go on its merry way.  Then, we set the visibility
>> map bit and go on our merry way.  Now we have a not-all-visible page
>> with the all-visible bit set in the visibility map.   Oops.
>
> Hmm, indeed.  This deserves its own open item.

Also agreed.

> Good point.  We could finagle things so the standby notices end-of-recovery
> checkpoints and blocks the optimization for older snapshots.  For example,
> maintain an integer count of end-of-recovery checkpoints seen and store that
> in each Snapshot instead of takenDuringRecovery; use 0 on a master.  When the
> global value is greater than the snapshot's value, disable the optimization
> for that snapshot.  I don't know whether the optimization is powerful enough
> to justify that level of trouble, but it's an option to consider.

I suspect not. It seems we can make index-only scans work without
doing something like this; it's only the sequential-scan optimization
we might lose. But nobody's ever really complained about not having
that, to my knowledge.

> For a different approach, targeting the fragility, we could add assertions to
> detect unexpected cases of a page bearing PD_ALL_VISIBLE submitted as a
> full-page image.

The holes are narrow enough that I fear such cases would be detected
only on high-velocity production systems, which is not exactly where
you want to find out about such problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-16 20:13:48
Message-ID: CA+Tgmobo4=d_WZZQW6XrCRW5WsFdS6KhMnSPJSoQT3vk5EaP_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 1:58 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> If we do need to do something, then introduce concept of a visibility conflict.
>
> On replay:
> If feedback not set, set LSN of visibility conflict on PROCs that
> conflict, if not already set.
>
> On query:
> If feedback not set, check conflict LSN against page, if page is
> later, check visibility.

Hmm, should have read the whole thread before replying. This similar
to what I just proposed in response to Heikki's message, but using LSN
in lieu of (or maybe you mean in addition to) XID.

I don't think we can ignore the need to throw conflicts just because
hot_standby_feedback is set; there are going to be corner cases, for
example, when it's just recently been turned on and the master has
already done cleanup; or if the master and standby have recently
gotten disconnected for even just a few seconds.

But fundamentally we all seem to be converging on some variant of the
"soft conflict" idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-04-27 00:03:44
Message-ID: CA+TgmoY-YEfiOk6FcUhR3+KxAz9vhh3dKBNpxk3uvTwnQFWZ4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 4:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> But fundamentally we all seem to be converging on some variant of the
> "soft conflict" idea.

So, as a first step, I've committed a patch that just throws a hard
conflict. I think we probably want to optimize this further, and I'm
going to work investigate that next. But it seemed productive to get
this much out of the way first, so I did.

In studying this, it strikes me that it would be rather nicer if we
recovery conflicts could somehow arrange to roll back the active
transaction by some means short of a FATAL error. I think there are
some protocol issues with doing that, but I still wish we could figure
out a way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-05-02 12:41:02
Message-ID: CA+TgmoYg6R+JZi+zwbaKmL4C_YeiDLTUeqiuK5O-CcuTzeZz5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 26, 2012 at 8:03 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So, as a first step, I've committed a patch that just throws a hard
> conflict.  I think we probably want to optimize this further, and I'm
> going to work investigate that next.  But it seemed productive to get
> this much out of the way first, so I did.

I've been thinking about this some more. What's worrying me is that a
visibility conflict, however we implement it, could be *worse* from
the user's point of view than just killing the query. After all,
there's a reasonable likelihood that a single visibility map page
covers the whole relation (or all the blocks that the user is
interested in), so any sort of conflict is basically going to turn the
index-only scan into an index-scan plus some extra overhead. And if
the planner had known that the user was going to get an index-only
scan rather than just a plain index scan, it might well have picked
some other plan in the first place.

Another problem is that, if we add a test for visibility conflicts
into visibilitymap_test(), I'm afraid we're going to drive up the cost
of that function very significantly. Previous testing suggests that
that efficiency or lack thereof of that function is already a
performance problem for index-only scans, which kinda makes me not
that excited about adding another branch in there somewhere (and even
less excited about any proposed implementation that would add an
lwlock acquire/release or similar).

So on further reflection I'm thinking it may be best just to stick
with a hard conflict for now and see what feedback we get from beta
testers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans vs. Hot Standby, round two
Date: 2012-05-04 08:57:13
Message-ID: CA+U5nMK6rT1s9ChY3dv0kmBiUe7xRSr+rqQuf6FvqacgLM3x9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2 May 2012 13:41, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> So on further reflection I'm thinking it may be best just to stick
> with a hard conflict for now and see what feedback we get from beta
> testers.

Which is what I was expecting y'all to conclude once you'd looked at
the task in more detail.

And I'm happy with the concept of beta being a period where we listen
to feedback, not just bugs, and decide whether further refinements are
required.

What I think is possible is to alter the conflict as it hits the
backend. If a backend has enable_indexonly = off then it wouldn't be
affected by those conflicts anyhow. So if we simply record whether we
are using an index only plan then we'll know whether to ignore it or
abort. I think we can handle that by marking the snapshot at executor
startup time. Needs a few other pushups but not that hard.

The likelihood is that SQL that uses index only won't run for long
enough to be cancelled anyway.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services