index-only scans vs. Hot Standby, round two

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: index-only scans vs. Hot Standby, round two
Date: 2012-04-13 16:33:06
Message-ID: CA+TgmoaRhoYP=13PhkevEXhUDUscu7hjmN4GsRhH=tzGorKNRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Currently, we have a problem with index-only scans in Hot Standby
mode: the xmin horizon on the standby might lag the master, and thus
an index-only scan might mistakenly conclude that no heap fetch is
needed when in fact it is. I suggested that we handle this by
suppressing generation of index-only scan plans in Hot Standby mode,
but Simon, Noah, and Dimitri were arguing that we should instead do
the following, which is now on the open items list:

* Make XLOG_HEAP2_VISIBLE records generate recovery snapshot conflicts
so that IndexOnlyScans work on Hot Standby

Ideally, this would also allow us to remove the following hack from
heapgetpage(), which would improve sequential-scan performance for Hot
Standby.

/*
* If the all-visible flag indicates that all tuples on the page are
* visible to everyone, we can skip the per-tuple visibility tests. But
* not in hot standby mode. A tuple that's already visible to all
* transactions in the master might still be invisible to a read-only
* transaction in the standby.
*/
all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;

As far as I can see, to make this work, we'd need to have
lazy_scan_heap() compute the latest xmin on any potentially
all-visible page and treat that as a cutoff XID, cancelling queries
with snapshots whose xmins equal or precede that value, just as we
currently do when freezing tuples (cf xlog_heap_freeze). There are
two things to worry about: correctness, and more query cancellations.

In the department of query cancellations, I believe Noah argued
previously that this wasn't really going to cause a problem. And,
indeed, if the master has a mix of inserts, updates, and deletes, then
it seems likely that any recovery conflicts generated this way would
be hitting about the same place in the XID space as the ones caused by
pruning away dead tuples. What will be different, if we go this
route, is that an insert-only master, which right now only generates
conflicts at freezing time, will instead generate conflicts much
sooner, as soon as the relation is vacuumed. I do not use Hot Standby
enough to know whether this is a problem, and I'm not trying to block
this approach, but I do want to make sure that everyone agrees that
it's acceptable before we go do it, because inevitably somebody is
going to get bit. In making our decision, I think we should keep in
mind that we are currently pretty laid-back about marking pages
all-visible, and that's probably something we're going to want to
change over time: for example, we recently discussed the fact that a
page with one dead tuple in it currently needs *two* vacuums to become
all-visible, because only the first vacuum pass is smart enough to
mark things all-visible, and the first heap pass only truncates the
dead tuple to a dead line pointer, so the page isn't all-visible a
that point. When we fix that, which I think we're almost certainly
going to want to do at some point, then that means these conflicts
will occur that much sooner. I think we will likely also want to
consider marking things all-visible on the fly at some point in the
future; for example, if we pull a buffer in for an index-only scan and
dirty it by setting a hint bit, we might want to take the plunge and
mark it all-visible before evicting it, to save effort the next time.
Again, not trying to dissuade anyone, just making sure we've thought
through it enough to avoid being unhappy later. It's worth noting
that none of this will normally matter if hot_standby_feedback=on, so
part of the analysis here may depend on how many people we think have
flipped that switch.

As to correctness, after further review of lazy_scan_heap(), I think
there are some residual bugs here that need to be fixed whatever we
decide to do about the Hot Standby case:

1. I noticed that when we do PageSetAllVisible() today we follow it up
with SetBufferCommitInfoNeedsSave(). PD_ALL_VISIBLE is not merely a
hint, so I think these should be changed to MarkBufferDirty(), which
shouldn't make much difference given current code, but proposals to
handle hint changes differently than non-hint changes abound, so it's
probably not good to rely on those meaning the same thing forever.

2. More seriously, lazy_scan_heap() releases the buffer lock before
setting the all-visible bit on the heap. This looks just plain wrong
(and it's my fault, to be clear). Somebody else can come along after
we've released the buffer lock and mark the page not-all-visible.
That concurrent process will check the visibility map bit, find it
already clear, and go on its merry way. Then, we set the visibility
map bit and go on our merry way. Now we have a not-all-visible page
with the all-visible bit set in the visibility map. Oops. I think
this probably needs to be rewritten so that lazy_scan_heap() acquires
a pin on the visibility map page before locking the buffer for cleanup
and holds onto that pin until either we exit the range of heap pages
covered by that visibility map page or trigger index vac due to
maintenance_work_mem exhaustion. With that infrastructure in place,
we can set the visibility map bit at the same time we set the
page-level bit without risking holding the buffer lock across a buffer
I/O (we might have to hold the buffer lock across a WAL flush in the
worst case, but that hazard exists in a number of other places as well
and fixing it here is well out of scope).

Now, suppose we fix the above bugs and also add logic to generate
query-conflicts as described above. How far can the standby now trust
the visibility information? Setting the visibility map bit is a fully
WAL-logged operation, so anyone for whom seeing the bit as set would
potentially be a problem is certain to be killed before they get the
chance to become confused. And when we first open for read-only
connections after a new base backup, the initial snapshot had better
be able to see all XIDs which have committed prior to that point, and
only such things should be marked all-visible, so AFAICT the door is
nailed shut pretty tight here. I am a little less certain about the
page-level bit. On the master, it's possible for the PD_ALL_VISIBLE
bit to make it to disk before the WAL record that sets the visibility
map bit; the LSN interlock only protects the visibility map page
itself, as part of a complicated strategy to avoid emitting FPWs for
the entire heap when we vacuum an insert-only table. On the standby,
those same bits are going to get set when the xlog records covering
the setting of visibility map bit get replayed (which would be OK, in
the sense that the page-level bit would be trustworthy in this case),
or when a copy of the buffer makes it from master to standby by some
other means (which is the potentially problematic case). This could
happen either as part of a base backup, or via a FPW. I don't think
the base backup case is a problem for the same reasons that it's OK
for the visibility map bit case. If an FPW turns the bit on, then
somehow the page-level bit got set without the visibility map bit
getting set. I think the only for that to happen is:

1. vacuum on master sets the page-level bit and the visibility map bit
2. the heap page with the bit is written to disk BEFORE the WAL entry
generated in (1) gets flushed; this is allowable, since it's not an
error for the page-level bit to be set while the visibility-map bit is
unset, only the other way around
3. crash (still before the WAL entry is flushed)
4. some operation on the master emits an FPW for the page without
first rendering it not-all-visible

At present, I'm not sure there's any real problem here, since normally
an all-visible heap page is only going to get another WAL entry if
somebody does an insert, update, or delete on it, in which case the
visibility map bit is going to get cleared anyway. Freezing the page
might emit a new FPW, but that's going to generate conflicts anyway,
so no problem there. So I think there's no real problem here, but it
doesn't seem totally future-proof - any future type of WAL record that
might modify the page without rendering it not-all-visible would
create a very subtle hazard for Hot Standby. We could dodge the whole
issue by leaving the hack in heapgetpage() intact and just be happy
with making index-only scans work, or maybe somebody has a more clever
idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-04-13 16:40:09 Re: 9.2 release notes, beta time?
Previous Message Guillaume Lelarge 2012-04-13 16:17:33 Re: Why can't I use pgxs to build a plpgsql plugin?