Quick Links

Re: serializable lock consistency

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 04:24:01
Message-ID:	AANLkTinv9MED7f1Nn8zz=yBDLuhrQp6Y7yewQe7MX_4H@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 15, 2010 at 9:20 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec14, 2010, at 15:01 , Robert Haas wrote:
>> On Tue, Dec 14, 2010 at 7:51 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>>> - serializable lock consistency - I am fairly certain this needs
>>>> rebasing. I don't have time to deal with it right away. That sucks,
>>>> because I think this is a really important change.
>>> I can try to find some time to update the patch if it suffers from bit-rot. Would that help?
>>
>> Yes!
>
> I've rebased the patch to the current HEAD, and re-run my FK concurrency test suite,
> available from https://github.com/fgp/fk_concurrency, to verify that things still work.
>
> I've also asserts to the callers of heap_{update,delete,lock_tuple} to verify (and document)
> that update_xmax may only be InvalidTransactionId if a lockcheck_snapshot is passed to
> heap_{update,delete,lock_tuple}.
>
> Finally, I've improved the explanation in src/backend/executor/README of how row locks and
> REPEATABLE READ transactions interact, and tried to state the guarantees provided by
> FOR SHARE and FOR UPDATE locks more precisely.
>
> I've published my work to https://github.com/fgp/postgres/tree/serializable_lock_consistency,
> and attached an updated patch. I'd be happy to give write access to that GIT repository
> to anyone who wants to help getting this committed.

I am having a hard time understanding this patch. I decided to start
with the README, and I'm still lost. :-(

My understanding of the problem is as follows. Acquiring a lock on a
tuple prevents the tuple from being concurrently updated. You might
take such a lock on a tuple T, make some other modification to the
database M, and commit, in the hope that your lock will prevent a
concurrent transaction from updating T without seeing M. However, it
doesn't work. When your lock on T is released, a concurrent
serializable transaction will still be looking at the database using
its original snapshot, and therefore won't see M. In effect, you
might as well not have bothered obtaining a lock at all. (You'd get
the same wrong answer that way, without unnecessary blocking!)

To fix this, one must ensure that if a tuple T is locked - either for
update or for share - by transaction A, and if a serializable
transaction B whose snapshot doesn't permit it to see the effects of A
subsequently attempts to update T, it must roll back.

Questions:
1. Is my understanding of the problem correct?
2. Is my understanding of the fix correct?
3. Is that what this patch implements?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 09:02:39
Message-ID:	20F4E0FD-F416-477B-936A-C4A7FAFF65DA@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> My understanding of the problem is as follows. Acquiring a lock on a
> tuple prevents the tuple from being concurrently updated. You might
> take such a lock on a tuple T, make some other modification to the
> database M, and commit, in the hope that your lock will prevent a
> concurrent transaction from updating T without seeing M. However, it
> doesn't work. When your lock on T is released, a concurrent
> serializable transaction will still be looking at the database using
> its original snapshot, and therefore won't see M.

Exactly.

> In effect, you
> might as well not have bothered obtaining a lock at all. (You'd get
> the same wrong answer that way, without unnecessary blocking!)

If only serializable transactions are involved and they all commit,
then yes. The only reason for serializable transactions A to wait for
a lock taken by another transaction B is that B might abort, in which
case A may proceed.

> To fix this, one must ensure that if a tuple T is locked - either for
> update or for share - by transaction A, and if a serializable
> transaction B whose snapshot doesn't permit it to see the effects of A
> subsequently attempts to update T, it must roll back.

Yes. Otherwise, B cannot verify that the database is consistent.

Note that it's sufficient to check if B can see the effects of the
*latest* locker of T. If it can see those, it must also see the
effects of any previous locker. But because of this, B cannot
distinguish different lock strengths on T - even if A locked T
in exclusive mode, some transaction A2 may lock T in shared mode
after A has committed but before B inspects T.

> Questions:
> 1. Is my understanding of the problem correct?
> 2. Is my understanding of the fix correct?

Seems fine to me.

> 3. Is that what this patch implements?

It is. There is a small additional twist, however.

For serializable transactions, everything is as you explained it. Taking a
lock amounts to saying "If you can't see what I did, leave this tuple alone".
A read-committed transactions, though, sees different things at different
times since it takes a new snapshot for every statement. Since we cannot
raise a serialization error in a read-committed transaction, an UPDATE
or DELETE statement within a read-committed transaction may very well
modify a previously locked row, even if *doesn't* see the effects of some
concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the
locked row, however, *will* see those changes. This includes the snapshot
taken within any AFTER trigger fired on updating the locked row. Thus,
things for one fine for RI constraints enforced by triggers.

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 11:24:27
Message-ID:	AANLkTimaNcTysN9ve5HndgXnSMkNJXDajjnhdE98hoaD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> Yes. Otherwise, B cannot verify that the database is consistent.
>
> Note that it's sufficient to check if B can see the effects of the
> *latest* locker of T. If it can see those, it must also see the
> effects of any previous locker. But because of this, B cannot
> distinguish different lock strengths on T - even if A locked T
> in exclusive mode, some transaction A2 may lock T in shared mode
> after A has committed but before B inspects T.

This seems to point to a rather serious problem, though. If B sees
that the last locker A1 aborted, correctness requires that it roll
back B, because there might have been some other transaction A2 which
locked locked T and committed before A1 touched it. Implementing that
behavior could lead to a lot of spurious rollbacks, but NOT
implementing it isn't fully correct.

> For serializable transactions, everything is as you explained it. Taking a
> lock amounts to saying "If you can't see what I did, leave this tuple alone".
> A read-committed transactions, though, sees different things at different
> times since it takes a new snapshot for every statement. Since we cannot
> raise a serialization error in a read-committed transaction, an UPDATE
> or DELETE statement within a read-committed transaction may very well
> modify a previously locked row, even if *doesn't* see the effects of some
> concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the
> locked row, however, *will* see those changes. This includes the snapshot
> taken within any AFTER trigger fired on updating the locked row. Thus,
> things for one fine for RI constraints enforced by triggers.

I can't parse the last sentence of this paragraph. I think there is a
word or two wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Florian Pflug" <fgp(at)phlo(dot)org>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Florian Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 14:12:13
Message-ID:	51207.188.23.189.250.1292767933.squirrel@webmail.phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> Note that it's sufficient to check if B can see the effects of the
>> *latest* locker of T. If it can see those, it must also see the
>> effects of any previous locker. But because of this, B cannot
>> distinguish different lock strengths on T - even if A locked T
>> in exclusive mode, some transaction A2 may lock T in shared mode
>> after A has committed but before B inspects T.
>
> This seems to point to a rather serious problem, though. If B sees
> that the last locker A1 aborted, correctness requires that it roll
> back B, because there might have been some other transaction A2 which
> locked locked T and committed before A1 touched it. Implementing that
> behavior could lead to a lot of spurious rollbacks, but NOT
> implementing it isn't fully correct.

Certainly seems serios. How on earth did I manage to miss that one, I
wonder :-(

If only shared locks are invovled, the patch probably works correctly,
though, since they don't remove all traces of previous lockers, they
merely add themselves to the multi-xid holding the lock. That'd also
explain why my concurrency test-suite didn't trip over this. So we may
only need to do what you propose for exclusive locks.

I'll look at this later today in more detail - I'm not currently at home,
so I don't have access to the sources...

>> For serializable transactions, everything is as you explained it. Taking
>> a
>> lock amounts to saying "If you can't see what I did, leave this tuple
>> alone".
>> A read-committed transactions, though, sees different things at
>> different
>> times since it takes a new snapshot for every statement. Since we cannot
>> raise a serialization error in a read-committed transaction, an UPDATE
>> or DELETE statement within a read-committed transaction may very well
>> modify a previously locked row, even if *doesn't* see the effects of
>> some
>> concurrent locker. Any snapshot taken after the UDPATE or DELETE hit the
>> locked row, however, *will* see those changes. This includes the
>> snapshot
>> taken within any AFTER trigger fired on updating the locked row. Thus,
>> things for one fine for RI constraints enforced by triggers.
>
> I can't parse the last sentence of this paragraph. I think there is a
> word or two wrong.

Ups, sorry for that. s/for one/are - the sentence should have read "Thus,
things are find for RI constraints enforced by triggers".

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	fgp(at)phlo(dot)org
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 16:01:08
Message-ID:	AANLkTinkRccH_8MUGU31SvxncTdgVwJjL67V2PeCPgQc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 19, 2010 at 9:12 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>> Note that it's sufficient to check if B can see the effects of the
>>> *latest* locker of T. If it can see those, it must also see the
>>> effects of any previous locker. But because of this, B cannot
>>> distinguish different lock strengths on T - even if A locked T
>>> in exclusive mode, some transaction A2 may lock T in shared mode
>>> after A has committed but before B inspects T.
>>
>> This seems to point to a rather serious problem, though. If B sees
>> that the last locker A1 aborted, correctness requires that it roll
>> back B, because there might have been some other transaction A2 which
>> locked locked T and committed before A1 touched it. Implementing that
>> behavior could lead to a lot of spurious rollbacks, but NOT
>> implementing it isn't fully correct.
>
> Certainly seems serios. How on earth did I manage to miss that one, I
> wonder :-(
>
> If only shared locks are invovled, the patch probably works correctly,
> though, since they don't remove all traces of previous lockers, they
> merely add themselves to the multi-xid holding the lock. That'd also
> explain why my concurrency test-suite didn't trip over this. So we may
> only need to do what you propose for exclusive locks.

I'd be willing to bet, without checking, that if the previous shared
locker is no longer running by the time the next one comes along, we
don't create a multixact in the first place. And even if it does,
what about this: exclusive lock, commit, shared lock, abort.

As unhappy as I am with the present behavior, this cure sounds like it
might be worse than the disease.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-19 18:57:25
Message-ID:	AA7DB036-4DDA-42C6-B5A9-DB6AD51B9EBC@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec19, 2010, at 17:01 , Robert Haas wrote:
> On Sun, Dec 19, 2010 at 9:12 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>> On Sun, Dec 19, 2010 at 4:02 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>>> Note that it's sufficient to check if B can see the effects of the
>>>> *latest* locker of T. If it can see those, it must also see the
>>>> effects of any previous locker. But because of this, B cannot
>>>> distinguish different lock strengths on T - even if A locked T
>>>> in exclusive mode, some transaction A2 may lock T in shared mode
>>>> after A has committed but before B inspects T.
>>>
>>> This seems to point to a rather serious problem, though. If B sees
>>> that the last locker A1 aborted, correctness requires that it roll
>>> back B, because there might have been some other transaction A2 which
>>> locked locked T and committed before A1 touched it. Implementing that
>>> behavior could lead to a lot of spurious rollbacks, but NOT
>>> implementing it isn't fully correct.
>>
>> Certainly seems serios. How on earth did I manage to miss that one, I
>> wonder :-(
>>
>> If only shared locks are invovled, the patch probably works correctly,
>> though, since they don't remove all traces of previous lockers, they
>> merely add themselves to the multi-xid holding the lock. That'd also
>> explain why my concurrency test-suite didn't trip over this. So we may
>> only need to do what you propose for exclusive locks.
>
> I'd be willing to bet, without checking, that if the previous shared
> locker is no longer running by the time the next one comes along, we
> don't create a multixact in the first place.

And you'd have won that bet. I just checked, and this is exactly what
we do.

> And even if it does,
> what about this: exclusive lock, commit, shared lock, abort.

I figured we could solve that by adding the exclusive locker's xid
to the multi-xid if it was >= GlobalXmin.

But...

Playing games with multi-xid lifetimes helps nothing in case
T1 locks, T1 commits, T2 updates, T2 aborts, all after T0
took its snapshot but before T0 attempts to delete. :-(

Actually, wait a second...

What we're interested here is the last non-aborted xmax of the tuple.
If that is invisible to some serializable transaction which tries
to modify the tuple, we raise a serialization error.

In the case of updates, deletes and exclusive locks, we always wait
for the xid(s) (more than if share-locked) in xmax to either commit or
abort, and thus know their fate. We can therefore track the latest
non-aborted xmax in a single additional field "xlast" if we simply
save the old xmax there before overwriting it *if* that xmax did
actually commit. Otherwise we leave xlast as it was.

The same works for share locks. If the tuple was previously locked
exclusively, updated or deleted, we proceed as above. If the tuple
was previously share-locked, some other lockers may still be in
progress. If at least one of them has committed, we again store its
xid in xlast. Otherwise, we again leave xlast as it was.

Note that xlast is never a multi-xid!

When a serializable transaction updates, deletes or exclusively locks
a tuple, we check if either xmax *or* xlast is invisible. If so,
we raise a serialization error.

If we reuse the legacy field xvac to store xlast, we don't get into
trouble with binary upgrades either. We' need to find a way to deal
with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that
seems manageable..

Does that sound viable?

> As unhappy as I am with the present behavior, this cure sounds like it
> might be worse than the disease.

I agree. Aborting due to conflicts with aborted transactions really
seems like a bad idea - there probably cases were that would lead
to one spurious abort causing another causing another...

But the solution sketched may be a way out...

best regards,
Florian Pflug

PS: Thanks to anyone who put work into this! I'm extremely sorry that
I didn't catch this earlier, and thus caused unnecessary work on your
end :-(

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 06:16:34
Message-ID:	4D0EF4C2.7070208@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19.12.2010 20:57, Florian Pflug wrote:
> If we reuse the legacy field xvac to store xlast, we don't get into
> trouble with binary upgrades either. We' need to find a way to deal
> with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that
> seems manageable..

xvac shares the field with command id, and cid is in use while the tuple
is being updated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 11:52:15
Message-ID:	2C95D837-89CB-4286-A80B-2D36A8B4385E@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec20, 2010, at 07:16 , Heikki Linnakangas wrote:
> On 19.12.2010 20:57, Florian Pflug wrote:
>> If we reuse the legacy field xvac to store xlast, we don't get into
>> trouble with binary upgrades either. We' need to find a way to deal
>> with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that
>> seems manageable..
>
> xvac shares the field with command id, and cid is in use while the tuple is being updated.

Right :-(

Well, that nails this coffin shut pretty tightly, unless we were willing to add
another field to heap tuples.

best regards,
Florian Pflug

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 12:13:59
Message-ID:	4D0F4887.20700@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.12.2010 13:52, Florian Pflug wrote:
> On Dec20, 2010, at 07:16 , Heikki Linnakangas wrote:
>> On 19.12.2010 20:57, Florian Pflug wrote:
>>> If we reuse the legacy field xvac to store xlast, we don't get into
>>> trouble with binary upgrades either. We' need to find a way to deal
>>> with tuples where HEAP_MOVED_IN or HEAP_MOVED_OUT is set, but that
>>> seems manageable..
>>
>> xvac shares the field with command id, and cid is in use while the tuple is being updated.
>
> Right :-(
>
> Well, that nails this coffin shut pretty tightly, unless we were willing to add
> another field to heap tuples.

One way to look at this is that the problem arises because SELECT FOR
UPDATE doesn't create a new tuple like UPDATE does. The problematic case
was:

> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0
> took its snapshot but before T0 attempts to delete. :-(

If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the
original tuple, but on the tuple that T1 created.

So one way to handle FOR UPDATE would be to lazily turn the lock
operation by T1 into a dummy update, when T2 updates the tuple. You
can't retroactively make a regular update on behalf of the locking
transaction that committed already, or concurrent selects would see the
same row twice, but it might work with some kind of a magic tuple that's
only followed through the ctid from the original one, and only for the
purpose of visibility checks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 14:11:53
Message-ID:	B8055ECD-3669-41C5-BFD0-F6AD7BB080DD@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote:
> One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does. The problematic case was:
>
>> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0
>> took its snapshot but before T0 attempts to delete. :-(
>
> If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created.

> So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updates the tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already, or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's only followed through the ctid from the original one, and only for the purpose of visibility checks.

In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the old tuple's xmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple with an aborted xmax and a ctid != t_self during scanning and vacuuming.

For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self, but do we actually depend on that?

FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. For FOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks, we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable.

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 16:54:19
Message-ID:	AANLkTikXjxGFbfhFg0gaxXJEEeWqdAwSEiauZzgUPqMX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 9:11 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote:
>> One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does. The problematic case was:
>>
>>> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0
>>> took its snapshot but before T0 attempts to delete. :-(
>>
>> If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created.
>
>> So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updates the tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already, or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's only followed through the ctid from the original one, and only for the purpose of visibility checks.
>
> In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the old tuple's xmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple with an aborted xmax and a ctid != t_self during scanning and vacuuming.
>
> For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self, but do we actually depend on that?

My first reaction to all of this is that it sounds awfully grotty.

> FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. For FOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks, we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable.

Even in the original version of this patch, there's a non-trivial
overhead here when a multi-xid exists that doesn't exist today: a
serializable transaction has to grovel through the XIDs in the
multi-xact and figure out whether any of them are new enough to be a
problem. I fear that this whole approach is a case of trying to jam a
square peg through a round hole. We're trying to force the on-disk
format that we have to meet a requirement it wasn't designed for, and
it's looking pretty ugly. Kevin Grittner's work is a whole different
approach to this problem, and while that's obviously not fully
debugged and committed yet either, it's often easier to design a new
tool to solve a particular problem than to make an existing tool that
was really meant for something else do some new thing in addition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: serializable lock consistency
Date:	2010-12-20 17:39:07
Message-ID:	4D0F405B0200002500038834@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Kevin Grittner's work is a whole different approach to this
> problem, and while that's obviously not fully debugged and
> committed yet either, it's often easier to design a new tool to
> solve a particular problem than to make an existing tool that was
> really meant for something else do some new thing in addition.

I see Florian's patch meeting a real need though, even though it is
an orthogonal solution to the same problem. For those converting to
PostgreSQL from Oracle, there are shops with a lot of code
which counts on the behavior which Florian is trying to implement.
Without Florian's patch, if they do a minimal-effort conversion to
PostgreSQL (without removing SELECT FOR SHARE/UPDATE clauses or
explicit locks) and count on SSI behavior, they will be paying for
the cost of integrity enforcement *twice*. It also may be useful
for those who can't easily structure their code to automatically
retry transactions which experience a serialization failure.

For those shops doing "green field" development or converting from
products with true serializability (DB2, Sybase ASE, InnoDB, etc.)
it will probably be easier for them to just use SSI; and I have
reason to believe that SSI will generally perform better, so
Florian's patch doesn't obviate the need for SSI.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 17:45:26
Message-ID:	AANLkTim74H6Jzzh_QPKa4XSJ2CcFwQi_6VwtV8mtEX=9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 12:39 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> I see Florian's patch meeting a real need though,

I agree, but that whole approach seems to be foundering on the rocks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 17:49:22
Message-ID:	AA2EC79C-CC2C-450A-BF4D-6A037E2D0C08@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec20, 2010, at 17:54 , Robert Haas wrote:
> On Mon, Dec 20, 2010 at 9:11 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Dec20, 2010, at 13:13 , Heikki Linnakangas wrote:
>>> One way to look at this is that the problem arises because SELECT FOR UPDATE doesn't create a new tuple like UPDATE does. The problematic case was:
>>>
>>>> T1 locks, T1 commits, T2 updates, T2 aborts, all after T0
>>>> took its snapshot but before T0 attempts to delete. :-(
>>>
>>> If T1 does a regular UPDATE, T2 doesn't overwrite the xmax on the original tuple, but on the tuple that T1 created.
>>
>>> So one way to handle FOR UPDATE would be to lazily turn the lock operation by T1 into a dummy update, when T2 updates the tuple. You can't retroactively make a regular update on behalf of the locking transaction that committed already, or concurrent selects would see the same row twice, but it might work with some kind of a magic tuple that's only followed through the ctid from the original one, and only for the purpose of visibility checks.
>>
>> In the case of an UPDATE of a recently locked tuple, we could avoid having to insert a dummy tuple by storing the old tuple's xmax in the new tuple's xmax. We'd flag the old tuple, and attempt to restore the xmax of any flagged tuple with an aborted xmax and a ctid != t_self during scanning and vacuuming.
>>
>> For DELETEs, that won't work. However, could we maybe abuse the ctid to store the old xmax? It currently contains t_self, but do we actually depend on that?
>
> My first reaction to all of this is that it sounds awfully grotty.
Well, there's precedent for overlaying fields in the tuple header for
space efficiency...

>> FOR-SHARE and FOR-UPDATE locks could preserve information about the latest committed locker by creating a multi-xid. For FOR-SHARE locks, we'd just need to ensure that we only remove all but one finished transactions. For FOR-UPDATE locks, we'd need to create a multi-xid if the old xmax is >= GlobalXmin, but I guess that's tolerable.
>
> Even in the original version of this patch, there's a non-trivial
> overhead here when a multi-xid exists that doesn't exist today: a
> serializable transaction has to grovel through the XIDs in the
> multi-xact and figure out whether any of them are new enough to be a
> problem.
It has to grovel through them anyway to see if any one of them is
still running (in the UPDATE/DELETE/FOR-UPDATE case), or needs to
copy them into a new multi-xid (in the FOR-SHARE case). Scanning
it a second time thus won't cause any further IO. If the second scan
is really a concern here, it could probably be folded into the first,
but I doubt very much that it'd be worth the added complexity.

Having to create a multi-xid for FOR-UPDATE locks is a real concern,
though. But if overlaying/abusing ctid proves to be possible in the case
of DELETEs, the same could be done for FOR-SHARE and FOR-UPDATE locks.
Only the UPDATE case would then remain...

> I fear that this whole approach is a case of trying to jam a
> square peg through a round hole. We're trying to force the on-disk
> format that we have to meet a requirement it wasn't designed for, and
> it's looking pretty ugly.
Certainly true. But most of the grotty-ness isn't inherently part
of a solution to the problem at hand. Rather, it's imposed on us
by the requirement for binary upgradability. IMHO, it's the price
we pay for having that.

> Kevin Grittner's work is a whole different
> approach to this problem, and while that's obviously not fully
> debugged and committed yet either, it's often easier to design a new
> tool to solve a particular problem than to make an existing tool that
> was really meant for something else do some new thing in addition.
Kevin's work only solves the problem at hand if we removed support for
what we call SERIALIZABLE now completely. As it stands, however, what
we now call SERIALIZABLE will still be available as isolation level
REPEATABLE READ, just as it is now, and will retain its quirky behaviour
regarding row-level locks. For some workloads at least, Kevin's also
introduces many more reasons for spurious serialization failures.

Don't take me wrong here - I think what Kevin is doing is very important,
and will be a huge leap for postgres, putting us ahead of all commercial
and open-source competitors that I know of.

But having to pay the price for that to be able to simply enforce RI
constraints slightly outside of what FK constraints can do still seems
wrong. For driving a screw into a piece of wood, a screwdriver still beats
even the most powerful hammer, and leaves fewer traces on the wood...

BTW, I realized that preventing UPDATEs, DELETEs and FOR-UPDATE locks
from removing all traces of previous lock owners would also help to solve the
bug cautioned about here:
http://www.postgresql.org/docs/9.0/interactive/sql-select.html#SQL-FOR-UPDATE-SHARE

Briefly, the problem is if you do
BEGIN;
SELECT * FROM mytable WHERE key = 1 FOR UPDATE;
SAVEPOINT s;
UPDATE mytable SET ... WHERE key = 1;
ROLLBACK TO s;
the row ends up unlocked, instead of locked FOR UPDATE.

While the current behaviour of SERIALIZABLE transactions regarding row locks
may be categorized not as a bug but as a lack of features, this problem is IMHO
clearly a bug. A bug that it's possible to live with, yes, but still a bug.

For me, this is another very good reason to explore this further. Plus, it
improves the ratio of grotty-ness vs. number-of-problems-soved ;-)

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 17:54:20
Message-ID:	AANLkTikgFb5+7FcYEnfFocLL96PdwDP3aTuBd-aWYbKb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> For me, this is another very good reason to explore this further. Plus, it
> improves the ratio of grotty-ness vs. number-of-problems-soved ;-)

By all means, look into it further. I fear the boat is filling up
with water, but if you manage to come up with a workable solution I'll
be as happy as anyone, promise!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 22:32:46
Message-ID:	F6F43EA1-C3A1-410F-8CAA-837BEC767E1A@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec20, 2010, at 18:54 , Robert Haas wrote:
> On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> For me, this is another very good reason to explore this further. Plus, it
>> improves the ratio of grotty-ness vs. number-of-problems-soved ;-)
>
> By all means, look into it further. I fear the boat is filling up
> with water, but if you manage to come up with a workable solution I'll
> be as happy as anyone, promise!

I'll try to create a details proposal. To do that, however, I'll require
some guidance on whats acceptable and whats not.

Here's a summary of the preceding discussion

To deal with aborted transactions correctly, we need to track the last
locker of a particular tuple that actually committed. If we also want
to fix the bug that causes a row lock to be lost upon doing
lock;savepoint;update;restore that "latest committed locker" will
sometimes need to be a set, since it'll need to store the outer
transaction's xid as well as the latest actually committed locker.

As long as no transaction aborts are involved, the tuple's xmax
contains all the information we need. If a transaction updates,
deletes or locks a row, the previous xmax is overwritten. If the
transaction later aborts, we cannot decide whether it has previously
been locked or not.

And these ideas have come up

A) Transactions who merely lock a row could put the previous
locker's xid (if >= GlobalXmin) *and* their own xid into a multi-xid,
and store that in xmax. For shared locks, this merely means cleaning
out the existing multi-xid a bit less aggressively. There's
no risk of bloat there, since we only need to keep one committed
xid, not all of them. For exclusive locks, we currently never
create a multi-xid. That'd change, we'd need to create one
if we find a previous locker with an xid >= GlobalXmin. This doesn't
solve the UPDATE and DELETE cases. For SELECT-FOR-SHARE this
is probably the best option, since it comes very close to what
we do currently.

B) A transaction who UPDATEs or DELETEs a tuple could create an
intermediate lock-only tuple which'd contain the necessary
information about previous lock holders. We'd only need to do
that if there actually is one with xid >= GlobalXmin. We could
then choose whether to do the same for SELECT-FOR-UPDATE, or
whether we'd prefer to go with (A)

C) The ctid field is only necessary for updated tuples. We could thus
overlay it with a field which stores the last committed locker after
a DELETE. UPDATEs could be handled either as in (B), or by storing the
information in the ctid-overlay in the *new* tuple. SELECT-FOR-UPDATE
could again either also use the ctid overlay or use (A).

D) We could add a new tuple header field xlatest. To support binary
upgrade, we'd need to be able to read tuples without that field
also. We could then either create a new tuple version upon the
first lock request to such a tuple (which would then include the
new header), or we could simply raise a serialization error if
a serializable transaction tried to update a tuple without the
field whose xmax was aborted and >= GlobalXmin.

I have the nagging feeling that (D) will meet quite some resistance. (C) was
too well received either, though I wonder if that'd change if the grotty-ness
was hidden behind a API, much xvac/cmin/cmax overlay is. (B) seems like a
lot of overhead, but maybe cleaner. More research is needed though to check
how it'd interact with HOT and how to get the locking right. (A) is IMHO the
best solution for the SELECT-FOR-SHARE since it's very close to what we do
today.

Any comments? Especially of the "don't you dare" kind?

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-20 23:08:31
Message-ID:	AANLkTiki_JdmgLhqWHtUCA4N92jPSXrCRnOq7NQQ8fEm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 5:32 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Dec20, 2010, at 18:54 , Robert Haas wrote:
>> On Mon, Dec 20, 2010 at 12:49 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>> For me, this is another very good reason to explore this further. Plus, it
>>> improves the ratio of grotty-ness vs. number-of-problems-soved ;-)
>>
>> By all means, look into it further. I fear the boat is filling up
>> with water, but if you manage to come up with a workable solution I'll
>> be as happy as anyone, promise!
>
> I'll try to create a details proposal. To do that, however, I'll require
> some guidance on whats acceptable and whats not.
>
> Here's a summary of the preceding discussion
>
> To deal with aborted transactions correctly, we need to track the last
> locker of a particular tuple that actually committed. If we also want
> to fix the bug that causes a row lock to be lost upon doing
> lock;savepoint;update;restore that "latest committed locker" will
> sometimes need to be a set, since it'll need to store the outer
> transaction's xid as well as the latest actually committed locker.
>
> As long as no transaction aborts are involved, the tuple's xmax
> contains all the information we need. If a transaction updates,
> deletes or locks a row, the previous xmax is overwritten. If the
> transaction later aborts, we cannot decide whether it has previously
> been locked or not.
>
> And these ideas have come up
>
> A) Transactions who merely lock a row could put the previous
> locker's xid (if >= GlobalXmin) *and* their own xid into a multi-xid,
> and store that in xmax. For shared locks, this merely means cleaning
> out the existing multi-xid a bit less aggressively. There's
> no risk of bloat there, since we only need to keep one committed
> xid, not all of them. For exclusive locks, we currently never
> create a multi-xid. That'd change, we'd need to create one
> if we find a previous locker with an xid >= GlobalXmin. This doesn't
> solve the UPDATE and DELETE cases. For SELECT-FOR-SHARE this
> is probably the best option, since it comes very close to what
> we do currently.
>
> B) A transaction who UPDATEs or DELETEs a tuple could create an
> intermediate lock-only tuple which'd contain the necessary
> information about previous lock holders. We'd only need to do
> that if there actually is one with xid >= GlobalXmin. We could
> then choose whether to do the same for SELECT-FOR-UPDATE, or
> whether we'd prefer to go with (A)
>
> C) The ctid field is only necessary for updated tuples. We could thus
> overlay it with a field which stores the last committed locker after
> a DELETE. UPDATEs could be handled either as in (B), or by storing the
> information in the ctid-overlay in the *new* tuple. SELECT-FOR-UPDATE
> could again either also use the ctid overlay or use (A).
>
> D) We could add a new tuple header field xlatest. To support binary
> upgrade, we'd need to be able to read tuples without that field
> also. We could then either create a new tuple version upon the
> first lock request to such a tuple (which would then include the
> new header), or we could simply raise a serialization error if
> a serializable transaction tried to update a tuple without the
> field whose xmax was aborted and >= GlobalXmin.
>
> I have the nagging feeling that (D) will meet quite some resistance. (C) was
> too well received either, though I wonder if that'd change if the grotty-ness
> was hidden behind a API, much xvac/cmin/cmax overlay is. (B) seems like a
> lot of overhead, but maybe cleaner. More research is needed though to check
> how it'd interact with HOT and how to get the locking right. (A) is IMHO the
> best solution for the SELECT-FOR-SHARE since it's very close to what we do
> today.
>
> Any comments? Especially of the "don't you dare" kind?

I think any solution based on (D) has zero chance of being accepted,
and it wouldn't be that high except that probabilities can't be
negative. Unfortunately, I don't understand (A) or (B) well enough to
comment intelligently.

My previously expressed concern about (C) wasn't based on ugliness,
but rather on my believe that there is likely a whole lot of code
which relies on the CTID being a self-link when no UPDATE has
occurred. We'd have to be confident that all such cases had been
found and fixed, which might not be easy to be confident about.

I have a more "meta" concern, too. Let's suppose (just for example)
that you write some absolutely brilliant code which makes it possible
to overlay CTID and which has the further advantage of being
stunningly obvious, so that we have absolute confidence that it is
correct. The obvious question is - if we can overlay CTID in some
situations, is there a better use for that than this? Just to throw
out something that might be totally impractical, maybe we could get
rid of XMAX. If the tuple hasn't been updated, the CTID field stores
the XMAX; if it HAS been updated, the CTID points to the successor
tuple, whose XMIN is our XMAX. I'm sure there are a bunch of reasons
why that doesn't actually work - I can think of some of them myself -
but the point is that there's an opportunity cost to stealing those
bits. Once they're dedicated to this purpose, they can't ever be
dedicated to any other purpose. Space in the tuple header is
precious, and I am not at all convinced that we should be willing to
surrender any for this. We have to believe not only that this change
is good, but also that it's more good than some other purpose to which
that bit space could potentially be put.

Now obviously, overlaying onto space that's already reserved is better
than allocating new space. But it's not free.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-21 00:19:57
Message-ID:	1DB62900-B7A9-45A8-85DE-6E686C1B4252@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec21, 2010, at 00:08 , Robert Haas wrote:
> My previously expressed concern about (C) wasn't based on ugliness,
> but rather on my believe that there is likely a whole lot of code
> which relies on the CTID being a self-link when no UPDATE has
> occurred. We'd have to be confident that all such cases had been
> found and fixed, which might not be easy to be confident about.

Thats certainly a valid concern, and one that a credible proposal
will have to address.

> The obvious question is - if we can overlay CTID in some
> situations, is there a better use for that than this? Just to throw
> out something that might be totally impractical, maybe we could get
> rid of XMAX. If the tuple hasn't been updated, the CTID field stores
> the XMAX; if it HAS been updated, the CTID points to the successor
> tuple, whose XMIN is our XMAX. I'm sure there are a bunch of reasons
> why that doesn't actually work - I can think of some of them myself -
> but the point is that there's an opportunity cost to stealing those
> bits. Once they're dedicated to this purpose, they can't ever be
> dedicated to any other purpose.

Yes, there is an opportunity cost there, I can't argue with that.

> Space in the tuple header is
> precious, and I am not at all convinced that we should be willing to
> surrender any for this.

Thats a pretty tight space to maneuver in, though. So tight, in fact,
that I may as well give up, barring some absolutely genius idea, which
I don't even know where to look for at the moment.

Every feature has its price, and if giving up on completely hypothetical
future savings is too costly, then surely anything else I might suggest
is too :-(

> We have to believe not only that this change
> is good, but also that it's more good than some other purpose to which
> that bit space could potentially be put.

Unfortunately, I'm running out of arguments for why *is* important
and *is* worth paying a price. I've been forced to simply give up on
making some database-side constraint enforcement 100% waterproof in
the past. That bugged me greatly each time, and each time weakened
my case when I tried to explain to client why enforcing such things
in the database is a Good Thing. I therefore have a hard time trying
to understand why people mostly seem to regard this is a non-issue
or merely a nice-to-have. Regarding the sub-transaction vs. locking
issue - I haven't been bitten by this personally, since I tend to
avoid using sub-transactions at all. But if I did, I'd feel even
stronger about this, since it's clearly a bug.

best regards,
Florian Pflug

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: serializable lock consistency
Date:	2010-12-21 01:47:57
Message-ID:	AANLkTinboWgcxfFpKQNNvXNxGYixKrsW76-ctLFQ8p0G@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 20, 2010 at 7:19 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> Space in the tuple header is
>> precious, and I am not at all convinced that we should be willing to
>> surrender any for this.
>
> Thats a pretty tight space to maneuver in, though. So tight, in fact,
> that I may as well give up, barring some absolutely genius idea, which
> I don't even know where to look for at the moment.
>
> Every feature has its price, and if giving up on completely hypothetical
> future savings is too costly, then surely anything else I might suggest
> is too :-(

Of the ideas proposed so far, the idea of somehow making use of the
existing multi-xid machinery to do this seems the most promising to
me. But I haven't yet understood exactly what you're proposing, or
fully thought through the performance implications, which is obviously
something that needs to happen, and if that doesn't pan out, then, as
I said upthread and you said here, yeah, we may need a new idea.

It would be useful if some other folks weighed in on this, too. I
understand what behavior we're trying to get here and why we want
that, but I don't necessarily have the greatest understanding of all
the details of the on-disk format.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company