Quick Links

SSI-related code drift between index_getnext() and heap_hot_search_buffer()

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-12 20:01:53
Message-ID:	BANLkTikaiW6mkT-Vs0K9wCq4LJ=g1BqEog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attempting to rebase on of Heikki's index-only scan patches has led me
to spend most of the day staring at the two functions named in the
subject line. I assume the code in these two functions has a common
origin, as a large amount of it is nearly identical; and one effect of
Heikki's patch is to get rid of the copy of the logic in
index_getnext() and instead call heap_hot_search_buffer(), with some
API changes to make that work. It didn't take me very long to rebase
the code well enough to make the regression tests pass, but the
isolation tests are failing, so I think I've mucked up the predicate
locking or something somewhere in here. Anyhow, that led me to a
careful comparison of the logic in the two above-named functions,
which as you might expect have drifted apart slightly.

In index_getnext(), we have:

if (valid)
{
/*
* If the snapshot is MVCC, we know that it could accept at
* most one member of the HOT chain, so we can skip examining
* any more members. Otherwise, check for continuation of the
* HOT-chain, and set state for next time.
*/
if (IsMVCCSnapshot(scan->xs_snapshot)
&& !IsolationIsSerializable())
scan->xs_next_hot = InvalidOffsetNumber;

I can't help noticing that the code no longer matches the comment
which precedes it. Apparently we can skip examining any more members
only when SSI is not in use, probably because we need to
predicate-lock the remaining tuples in the HOT chain. Another small
difficulty is that the logic in heap_hot_search_buffer() is slightly
different:

/* If it's visible per the snapshot, we must return it */
valid = HeapTupleSatisfiesVisibility(&heapTuple, snapshot, buffer);
CheckForSerializableConflictOut(valid, relation, &heapTuple, buffer);
if (valid)
{
ItemPointerSetOffsetNumber(tid, offnum);
PredicateLockTuple(relation, &heapTuple);
if (all_dead)
*all_dead = false;
if (IsolationIsSerializable())
match_found = true;
else
return true;
}

Here again we're refusing to bypass the remainder of the HOT chain
when the visible tuple is found, if SSI is in use. But the test is
slightly different. Here it applies whenever
IsolationIsSerializable(), whereas in index_getnext() it applies only
when IsolationIsSerializable() *and* we're using an MVCC snapshot.
Does that make any difference? Should we be consistent? Does SSI
care about non-MVCC snapshots?

Another thing I notice is that in heap_hot_search_buffer(), when this
SSI-related change applies, we go through and visit all the remaining
members of the HOT chain and then return the last (and presumably
only) tuple we encountered. But in index_getnext(), we return the
tuple we found immediately and then continue walking the HOT chain on
the next call. This difference seems possibly significant. If it's
necessary for correctness that we predicate-lock the invisible tuples,
then what happens if the scan stops here due to, say, a LIMIT?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-12 20:44:07
Message-ID:	4DCC0047020000250003D659@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> [SSI questions based on breakage trying to resolve code drift]

I'm looking at this and should have answers within an hour or two.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-12 23:02:03
Message-ID:	4DCC209B020000250003D670@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It didn't take me very long to rebase the code well enough to make
> the regression tests pass, but the isolation tests are failing, so
> I think I've mucked up the predicate locking or something
> somewhere in here. Anyhow, that led me to a careful comparison of
> the logic in the two above-named functions, which as you might
> expect have drifted apart slightly.

> In index_getnext()

> I can't help noticing that the code no longer matches the comment
> which precedes it.

Good point. That obviously needs to be fixed.

> Apparently we can skip examining any more members only when SSI is
> not in use

More precisely, it depends on whether the transaction which is
executing here is serializable; but yeah. (Some of our quick
returns are based on whether *any* serializable transaction is
active, and I didn't want anyone to mistake the test you're
describing for that one.)

> probably because we need to predicate-lock the remaining tuples in
> the HOT chain.

I had to think about this a while. If we see a tuple which *is*
visible to us, we need to acquire tuple locks on that and any later
versions of the tuple. When going through the index, we won't get
another shot at later HOT versions of the tuple, so we have to find
them now. If we can avoid it we should not lock any tuples
representing an *earlier* version of the row than we can see, or any
tuples from rows with no visible versions; unnecessarily locking
such tuples could reduce performance but would not allow anomalies.

> the logic in heap_hot_search_buffer() is slightly different

> Here again we're refusing to bypass the remainder of the HOT chain
> when the visible tuple is found, if SSI is in use. But the test
> is slightly different. Here it applies whenever
> IsolationIsSerializable(), whereas in index_getnext() it applies
> only when IsolationIsSerializable() *and* we're using an MVCC
> snapshot.

> Does SSI care about non-MVCC snapshots?

We should only be acquiring a predicate lock on a tuple for the
normal MVCC snapshot used by the serializable transaction. Perhaps
this is a test we should add to our "quick return" macros checked on
entry to most predicate.c functions; it seems burdensome to force
all callers to know such details. In index_getnext() it was doing
the check already, which is fine, but you're right -- we should be
checking this consistently to avoid false positives.

> Another thing I notice is that in heap_hot_search_buffer(), when
> this SSI-related change applies, we go through and visit all the
> remaining members of the HOT chain and then return the last (and
> presumably only) tuple we encountered.

That return value sounds like a bug. It certainly seems like it
should return the tuple which first caused it to set match_found =
true; that is to say, the same one which would be returned for a
non-serializable transaction.

> But in index_getnext(), we return the tuple we found immediately
> and then continue walking the HOT chain on the next call. This
> difference seems possibly significant. If it's necessary for
> correctness that we predicate-lock the invisible tuples, then what
> happens if the scan stops here due to, say, a LIMIT?

You're right -- it should go to the end of the chain for both. In
general a LIMIT can stop the acquisition of predicate locks because
you only need to lock rows which were *seen*. But at some point
after I added these calls it became clear that tuples representing
later versions of a visible and visited row also need to be locked.
I obviously didn't find all the places to fix when I had that
epiphany.

But that raises another question. The HOT updates are easy, because
they're on a page we're already visiting. The page is pinned and
locked already, so chasing down the HOT chain on a serializable
transaction is relatively cheap. What about non-HOT updates,
though? I'm pretty sure we handle those OK for most cases, but a
LIMIT which stops us after reading a visible-but-updated tuple and
before having read the new version of it could lead to a false
negative -- an undetected anomaly. I think this could also be a
factor for indexed min/max functions, semi-joins, and anti-joins.

It's easy to see how this one got missed in testing, because it
takes at least four separate transactions with certain timings to
the overlapping to hit the "next generation" type of anomalies, and
this takes some less frequently used feature to turn up in a
particular place in that set of transactions. Nonetheless, it is a
bug and should be fixed.

I think we need to do something about that, but it's not immediately
obvious to me what the best solution is. Chasing down the pointers
to subsequent versions of the row hardly seems sane. I would hate
to have to unconditionally escalate to a heap relation lock upon
seeing a visible-but-updated tuple, but I don't know whether it's
feasible to detect when we're in a scan which can complete early for
one of the aforementioned reasons.

Any other ideas on that one?

Anyway, I could clean up all but that last issue in the old code.
I'm not sure whether that makes sense if you're refactoring it
anyway. Would you like me to look at the refactored code to suggest
fixes? Would you rather do it yourself based on my answers here?
Do we need to sort out that last issue before proceeding on the
others?

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-13 03:44:51
Message-ID:	BANLkTi=7Tm9G3hsBSJ9hO8n12nMgAi+ydA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 12, 2011 at 7:02 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Anyway, I could clean up all but that last issue in the old code.
> I'm not sure whether that makes sense if you're refactoring it
> anyway. Would you like me to look at the refactored code to suggest
> fixes? Would you rather do it yourself based on my answers here?
> Do we need to sort out that last issue before proceeding on the
> others?

First, in contrast to your promise to respond with answers in an hour
or two, that was three hours and one minute. Stop slacking off![1]

Second, I haven't a clue how to fix this. What I was doing was of
course targeted toward 9.2, but I have half a thought that making
index_getnext() call heap_hot_search_buffer() might be a sensible
thing to do in 9.1, because code duplication = more bugs. On the
third hand, at the moment the code that Heikki wrote to do that is
tangled up in a bunch of other things that we almost certainly don't
want to do in 9.1, and it's not obvious that it can be cleanly
untangled, so maybe that's not the right idea after all.

I think a good starting point might be to design a test case that
fails with the current code, and come up with a plan for what to do
about it. I have a very ugly feeling about this problem. I agree
with your feeling that chasing down the update pointers isn't
sensible, but a whole-relation lock seems like it could lead to a lot
more rollbacks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] For the benefit of anyone on the audience who lacks a sense of
humor, this is a joke.

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-13 16:10:56
Message-ID:	4DCD11C0020000250003D6C1@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> Kevin Grittner wrote:
>> Anyway, I could clean up all but that last issue in the old code.
>> I'm not sure whether that makes sense if you're refactoring it
>> anyway. Would you like me to look at the refactored code to
>> suggest fixes? Would you rather do it yourself based on my
>> answers here? Do we need to sort out that last issue before
>> proceeding on the others?

> I haven't a clue how to fix this. What I was doing was of course
> targeted toward 9.2, but I have half a thought that making
> index_getnext() call heap_hot_search_buffer() might be a sensible
> thing to do in 9.1, because code duplication = more bugs. On the
> third hand, at the moment the code that Heikki wrote to do that is
> tangled up in a bunch of other things that we almost certainly
> don't want to do in 9.1, and it's not obvious that it can be
> cleanly untangled, so maybe that's not the right idea after all.
>
> I think a good starting point might be to design a test case that
> fails with the current code, and come up with a plan for what to
> do about it. I have a very ugly feeling about this problem. I
> agree with your feeling that chasing down the update pointers
> isn't sensible, but a whole-relation lock seems like it could lead
> to a lot more rollbacks.

OK, will work on a test case for this last issue, but it might make
sense to address some of the other points separately first. For one
thing it might allow you to continue on with your 9.2 work with
clean tests. I can't do much on any of it today, as I have to deal
with some other things before being away for a week.

This is such a remote corner case that it would be really good if
we can limit the relation locks to cases where we're somewhere near
that corner. I've been trying to work out how to do that -- not
there yet, but I see some glimmers of how it might be done. The
nice thing about putting together a test case for something this
hard to hit is that it helps clarify the dynamics of the problem,
and solutions sometimes just pop out of it.

FWIW, so far what I know is that it will take an example something
like the one shown here:

http://archives.postgresql.org/pgsql-hackers/2011-02/msg00325.php

with the further requirements that the update in T3 must not be a
HOT update, T1 would still need to acquire a snapshot before T2
committed while moving its current select down past the commit of
T3, and that select would need to be modified so that it would scan
the visible tuple and then stop (e.g., because of a LIMIT) before
reaching the tuple which represents the next version of the row.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: SSI-related code drift between index_getnext() and heap_hot_search_buffer()
Date:	2011-05-14 14:18:42
Message-ID:	BANLkTik8KxxjJ1KW-pO+WWBdTEAT+80ArQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 13, 2011 at 12:10 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> FWIW, so far what I know is that it will take an example something
> like the one shown here:
>
> http://archives.postgresql.org/pgsql-hackers/2011-02/msg00325.php
>
> with the further requirements that the update in T3 must not be a
> HOT update, T1 would still need to acquire a snapshot before T2
> committed while moving its current select down past the commit of
> T3, and that select would need to be modified so that it would scan
> the visible tuple and then stop (e.g., because of a LIMIT) before
> reaching the tuple which represents the next version of the row.

I think I see another problem here. Just before returning each tuple,
index_getnext() records in the IndexScanDesc the offset number of the
next tuple in the HOT chain, and the XMAX of the tuple being returned.
On the next call, it will go on to examine that TID checking, among
other things, whether the XMIN of the tuple at that location matches
the previously stored XMAX. But no buffer content locks is held
across calls. So consider a HOT chain A -> B. After returning A, the
IndexScanDesc will consider that we should next look at B. Now B
rolls back, and a new transaction updates A, so we now have A -> C.
(I believe this is possible.) When the next call to index_getnext()
occurs, it'll look at B and consider that it's reached the end of the
HOT chain - but in reality it has not, because it has never looked at
C.

Now, prior to SSI, I believe this did not matter, because the only
time we traversed the entire HOT chain rather than stopping at the
first visible tuple was when we were using a non-MVCC snapshot.
According to Heikki's submission notes for the patch I was trying to
rebase, the only time that happens is during CLUSTER, at which point
we have an AccessExclusiveLock on the table. But SSI wants to
traverse the whole HOT chain even when using an MVCC snapshot, so now
we (maybe) have a problem.

I think I have an inkling of how to plug this, but first I have to go
buy groceries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company