Quick Links

Re: Serializable Snapshot Isolation

Lists:	pgsql-hackers

From:	Kevin Grittner <grimkg(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu, heikki(dot)linnakangas(at)enterprisedb(dot)com
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-18 18:52:29
Message-ID:	AANLkTi=Scs5nsmi+L1+N4VN5+LQ2qA14nrG0ReQLehSC@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[Apologies for not reply-linking this; work email is down so I'm
sending from gmail.]

Based on feedback from Heikki and Tom I've reworked how I find the
top-level transaction. This is in the git repo, and the changes can
be viewed at:

http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <grimkg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu, heikki(dot)linnakangas(at)enterprisedb(dot)com
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-18 19:38:45
Message-ID:	4C951545.6050604@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/09/10 21:52, Kevin Grittner wrote:
> [Apologies for not reply-linking this; work email is down so I'm
> sending from gmail.]
>
> Based on feedback from Heikki and Tom I've reworked how I find the
> top-level transaction. This is in the git repo, and the changes can
> be viewed at:
>
> http://git.postgresql.org/gitweb?p=users/kgrittn/postgres.git;a=commitdiff;h=e29927c7966adba2443fdc4f64da9d282f95a05b

Thanks, much simpler. Now let's simplify it some more ;-)

ISTM you never search the SerializableXactHash table using a hash key,
except the one call in CheckForSerializableConflictOut, but there you
already have a pointer to the SERIALIZABLEXACT struct. You only re-find
it to make sure it hasn't gone away while you trade the shared lock for
an exclusive one. If we find another way to ensure that, ISTM we don't
need SerializableXactHash at all. My first thought was to forget about
VirtualTransactionId and use TransactionId directly as the hash key for
SERIALIZABLEXACT. The problem is that a transaction doesn't have a
transaction ID when RegisterSerializableTransaction is called. We could
leave the TransactionId blank and only add the SERIALIZABLEXACT struct
to the hash table when an XID is assigned, but there's no provision to
insert an existing struct to a hash table in the current hash table API.

So, I'm not sure of the details yet, but it seems like it could be made
simpler somehow..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Kevin Grittner <grimkg(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-19 13:48:42
Message-ID:	AANLkTi=5qe61-MOumCouTqwCCpec_mUZEAafjMmgn+os@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> ISTM you never search the SerializableXactHash table using a hash
> key, except the one call in CheckForSerializableConflictOut, but
> there you already have a pointer to the SERIALIZABLEXACT struct.
> You only re-find it to make sure it hasn't gone away while you
> trade the shared lock for an exclusive one. If we find another way
> to ensure that, ISTM we don't need SerializableXactHash at all. My
> first thought was to forget about VirtualTransactionId and use
> TransactionId directly as the hash key for SERIALIZABLEXACT. The
> problem is that a transaction doesn't have a transaction ID when
> RegisterSerializableTransaction is called. We could leave the
> TransactionId blank and only add the SERIALIZABLEXACT struct to the
> hash table when an XID is assigned, but there's no provision to
> insert an existing struct to a hash table in the current hash table
> API.
>
> So, I'm not sure of the details yet, but it seems like it could be
> made simpler somehow..

After tossing it around in my head for a bit, the only thing that I
see (so far) which might work is to maintain a *list* of
SERIALIZABLEXACT objects in memory rather than a using a hash table.
The recheck after releasing the shared lock and acquiring an
exclusive lock would then go through SerializableXidHash. I think
that can work, although I'm not 100% sure that it's an improvement.
I'll look it over in more detail. I'd be happy to hear your thoughts
on this or any other suggestions.

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <grimkg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-19 18:57:23
Message-ID:	4C965D13.2050806@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19/09/10 16:48, Kevin Grittner wrote:
> After tossing it around in my head for a bit, the only thing that I
> see (so far) which might work is to maintain a *list* of
> SERIALIZABLEXACT objects in memory rather than a using a hash table.
> The recheck after releasing the shared lock and acquiring an
> exclusive lock would then go through SerializableXidHash. I think
> that can work, although I'm not 100% sure that it's an improvement.

Yeah, also keep in mind that a linked list with only a few items is
faster to scan through than sequentially scanning an almost empty hash
table.

Putting that aside for now, we have one very serious problem with this
algorithm:

> While they [SIREAD locks] are associated with a transaction, they must survive
> a successful COMMIT of that transaction, and remain until all overlapping
> transactions complete.

Long-running transactions are already nasty because they prevent VACUUM
from cleaning up old tuple versions, but this escalates the problem to a
whole new level. If you have one old transaction sitting idle, every
transaction that follows consumes a little bit of shared memory, until
that old transaction commits. Eventually you will run out of shared
memory, and will not be able to start new transactions anymore.

Is there anything we can do about that? Just a thought, but could you
somehow coalesce the information about multiple already-committed
transactions to keep down the shared memory usage? For example, if you
have this:

1. Transaction <slow> begins
2. 100 other transactions begin and commit

Could you somehow group together the 100 committed transactions and
represent them with just one SERIALIZABLEXACT struct?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-20 14:09:51
Message-ID:	4C9724DF02000025000359BA@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> ISTM you never search the SerializableXactHash table using a hash
>> key, except the one call in CheckForSerializableConflictOut, but
>> there you already have a pointer to the SERIALIZABLEXACT struct.
>> You only re-find it to make sure it hasn't gone away while you
>> trade the shared lock for an exclusive one. If we find another
>> way to ensure that, ISTM we don't need SerializableXactHash at
>> all. My first thought was to forget about VirtualTransactionId
>> and use TransactionId directly as the hash key for
>> SERIALIZABLEXACT. The problem is that a transaction doesn't have
>> a transaction ID when RegisterSerializableTransaction is called.
>> We could leave the TransactionId blank and only add the
>> SERIALIZABLEXACT struct to the hash table when an XID is
>> assigned, but there's no provision to insert an existing struct
>> to a hash table in the current hash table API.
>>
>> So, I'm not sure of the details yet, but it seems like it could
>> be made simpler somehow..
>
> After tossing it around in my head for a bit, the only thing that
> I see (so far) which might work is to maintain a *list* of
> SERIALIZABLEXACT objects in memory rather than a using a hash
> table. The recheck after releasing the shared lock and acquiring
> an exclusive lock would then go through SerializableXidHash. I
> think that can work, although I'm not 100% sure that it's an
> improvement. I'll look it over in more detail. I'd be happy to
> hear your thoughts on this or any other suggestions.

I haven't come up with any better ideas. Pondering this one, it
seems to me that a list would be better than a hash table if we had
a list which would automatically allocate and link new entries, and
would maintain a list of available entries for (re)use. I wouldn't
want to sprinkle such an implementation in with predicate locking
and SSI code, but if there is a feeling that such a thing would be
worth having in shmqueue.c or some new file which uses the SHM_QUEUE
structure to provide an API for such functionality, I'd be willing
to write that and use it in the SSI code. Without something like
that, I have so far been unable to envision an improvement along the
lines Heikki is suggesting here.

Thoughts?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Kevin Grittner" <grimkg(at)gmail(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-22 18:42:34
Message-ID:	4C9A07CA0200002500035B4C@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attachment	Content-Type	Size
serializable-6.patch	text/plain	195.5 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <grimkg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-22 18:54:20
Message-ID:	4C9A50DC.4030801@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19/09/10 21:57, I wrote:
> Putting that aside for now, we have one very serious problem with this
> algorithm:
>
>> While they [SIREAD locks] are associated with a transaction, they must
>> survive
>> a successful COMMIT of that transaction, and remain until all overlapping
> > transactions complete.
>
> Long-running transactions are already nasty because they prevent VACUUM
> from cleaning up old tuple versions, but this escalates the problem to a
> whole new level. If you have one old transaction sitting idle, every
> transaction that follows consumes a little bit of shared memory, until
> that old transaction commits. Eventually you will run out of shared
> memory, and will not be able to start new transactions anymore.
>
> Is there anything we can do about that? Just a thought, but could you
> somehow coalesce the information about multiple already-committed
> transactions to keep down the shared memory usage? For example, if you
> have this:
>
> 1. Transaction <slow> begins
> 2. 100 other transactions begin and commit
>
> Could you somehow group together the 100 committed transactions and
> represent them with just one SERIALIZABLEXACT struct?

Ok, I think I've come up with a scheme that puts an upper bound on the
amount of shared memory used, wrt. number of transactions. You can still
run out of shared memory if you lock a lot of objects, but that doesn't
worry me as much.

When a transaction is commits, its predicate locks must be held, but
it's not important anymore *who* holds them, as long as they're hold for
long enough.

Let's move the finishedBefore field from SERIALIZABLEXACT to
PREDICATELOCK. When a transaction commits, set the finishedBefore field
in all the PREDICATELOCKs it holds, and then release the
SERIALIZABLEXACT struct. The predicate locks stay without an associated
SERIALIZABLEXACT entry until finishedBefore expires.

Whenever there are two predicate locks on the same target that both
belonged to an already-committed transaction, the one with a smaller
finishedBefore can be dropped, because the one with higher
finishedBefore value covers it already.

There. That was surprisingly simple, I must be missing something.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Kevin Grittner <grimkg(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu, Kevin(dot)Grittner(at)wicourts(dot)gov
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-22 23:14:06
Message-ID:	AANLkTikkwxs56YkVvW3AzTYxxvjMKj2Nf-nZmxZsfFC0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> When a transaction is commits, its predicate locks must be held,
> but it's not important anymore *who* holds them, as long as
> they're hold for long enough.
>
> Let's move the finishedBefore field from SERIALIZABLEXACT to
> PREDICATELOCK. When a transaction commits, set the finishedBefore
> field in all the PREDICATELOCKs it holds, and then release the
> SERIALIZABLEXACT struct. The predicate locks stay without an
> associated SERIALIZABLEXACT entry until finishedBefore expires.
>
> Whenever there are two predicate locks on the same target that
> both belonged to an already-committed transaction, the one with a
> smaller finishedBefore can be dropped, because the one with higher
> finishedBefore value covers it already.

I don't think this works. Gory details follow.

The predicate locks only matter when a tuple is being written which
might conflict with one. In the notation often used for the
dangerous structures, the conflict only occurs if TN writes
something which T1 can't read or T1 writes something which T0 can't
read. When you combine this with the fact that you don't have a
problem unless TN commits *first*, then you can't have a problem
with TN looking up a predicate lock of a committed transaction; if
it's still writing tuples after T1's commit, the conflict can't
matter and really should be ignored. If T1 is looking up a
predicate lock for T0 and finds it committed, there are two things
which must be true for this to generate a real conflict: TN must
have committed before T0, and T0 must have overlapped T1 -- T0 must
not have been able to see T1's write. If we have a way to establish
these two facts without keeping transaction level data for committed
transactions, predicate lock *lookup* wouldn't stand in the way of
your proposal.

Since the writing transaction is active, if the xmin of its starting
transaction comes before the finishedBefore value, they must have
overlapped; so I think we have that part covered, and I can't see a
problem with your proposed use of the earliest finishedBefore value.

There is a rub on the other point, though. Without transaction
information you have no way of telling whether TN committed before
T0, so you would need to assume that it did. So on this count,
there is bound to be some increase in false positives leading to
transaction rollback. Without more study, and maybe some tests, I'm
not sure how significant it is. (Actually, we might want to track
commit sequence somehow, so we can determine this with greater
accuracy.)

But wait, the bigger problems are yet to come.

The other way we can detect conflicts is a read by a serializable
transaction noticing that a different and overlapping serializable
transaction wrote the tuple we're trying to read. How do you
propose to know that the other transaction was serializable without
keeping the SERIALIZABLEXACT information? And how do you propose to
record the conflict without it? The wheels pretty much fall off the
idea entirely here, as far as I can see.

Finally, this would preclude some optimizations which I *think* will
pay off, which trade a few hundred kB more of shared memory, and
some additional CPU to maintain more detailed conflict data, for a
lower false positive rate -- meaning fewer transactions rolled back
for hard-to-explain reasons. This more detailed information is also
what seems to be desired by Dan S (on another thread) to be able to
log the information needed to be able to reduce rollbacks.

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <grimkg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, drkp(at)csail(dot)mit(dot)edu, Kevin(dot)Grittner(at)wicourts(dot)gov
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-23 05:21:01
Message-ID:	4C9AE3BD.6010109@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/09/10 02:14, Kevin Grittner wrote:
> There is a rub on the other point, though. Without transaction
> information you have no way of telling whether TN committed before
> T0, so you would need to assume that it did. So on this count,
> there is bound to be some increase in false positives leading to
> transaction rollback. Without more study, and maybe some tests, I'm
> not sure how significant it is. (Actually, we might want to track
> commit sequence somehow, so we can determine this with greater
> accuracy.)

I'm confused. AFAICS there is no way to tell if TN committed before T0
in the current patch either.

> But wait, the bigger problems are yet to come.
>
> The other way we can detect conflicts is a read by a serializable
> transaction noticing that a different and overlapping serializable
> transaction wrote the tuple we're trying to read. How do you
> propose to know that the other transaction was serializable without
> keeping the SERIALIZABLEXACT information?

Hmm, I see. We could record which transactions were serializable in a
new clog-like structure that wouldn't exhaust shared memory.

> And how do you propose to record the conflict without it?

I thought you just abort the transaction that would cause the conflict
right there. The other transaction is committed already, so you can't do
anything about it anymore.

> Finally, this would preclude some optimizations which I *think* will
> pay off, which trade a few hundred kB more of shared memory, and
> some additional CPU to maintain more detailed conflict data, for a
> lower false positive rate -- meaning fewer transactions rolled back
> for hard-to-explain reasons. This more detailed information is also
> what seems to be desired by Dan S (on another thread) to be able to
> log the information needed to be able to reduce rollbacks.

Ok, I think I'm ready to hear about those optimizations now :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-23 15:08:34
Message-ID:	4C9B27220200002500035BE9@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 23/09/10 02:14, Kevin Grittner wrote:
>> There is a rub on the other point, though. Without transaction
>> information you have no way of telling whether TN committed
>> before T0, so you would need to assume that it did. So on this
>> count, there is bound to be some increase in false positives
>> leading to transaction rollback. Without more study, and maybe
>> some tests, I'm not sure how significant it is. (Actually, we
>> might want to track commit sequence somehow, so we can determine
>> this with greater accuracy.)
>
> I'm confused. AFAICS there is no way to tell if TN committed
> before T0 in the current patch either.

Well, we can certainly infer it if the finishedBefore values differ.
And, as I said, if we don't eliminate this structure for committed
transactions, we could add a commitId or some such, with "precedes"
and "follows" tests similar to TransactionId.

>> The other way we can detect conflicts is a read by a serializable
>> transaction noticing that a different and overlapping
>> serializable transaction wrote the tuple we're trying to read.
>> How do you propose to know that the other transaction was
>> serializable without keeping the SERIALIZABLEXACT information?
>
> Hmm, I see. We could record which transactions were serializable
> in a new clog-like structure that wouldn't exhaust shared memory.
>
>> And how do you propose to record the conflict without it?
>
> I thought you just abort the transaction that would cause the
> conflict right there. The other transaction is committed already,
> so you can't do anything about it anymore.

No, it always requires a rw-conflict from T0 to T1 and a rw-conflict
from T1 to TN, as well as TN committing first and (T0 not being READ
ONLY or TN not overlapping T0). The number and complexity of the
conditions which must be met to cause a serialization failure are
what keep the failure rate reasonable. If we start rolling back
transactions every time one transaction simply reads a row modified
by a concurrent transaction I suspect that we'd have such a storm of
serialization failures in most workloads that nobody would want to
use it.

>> Finally, this would preclude some optimizations which I *think*
>> will pay off, which trade a few hundred kB more of shared memory,
>> and some additional CPU to maintain more detailed conflict data,
>> for a lower false positive rate -- meaning fewer transactions
>> rolled back for hard-to-explain reasons. This more detailed
>> information is also what seems to be desired by Dan S (on another
>> thread) to be able to log the information needed to be able to
>> reduce rollbacks.
>
> Ok, I think I'm ready to hear about those optimizations now :-).

Dan Ports is eager to implement "next key" predicate locking for
indexes, but wants more benchmarks to confirm the benefit. (Most of
the remaining potential optimizations carry some risk of being
counter-productive, so we want to go in with something conservative
and justify each optimization separately.) That one only affects
your proposal to the extent that the chance to consolidate locks on
the same target by committed transactions would likely have fewer
matches to collapse.

One that I find interesting is the idea that we could set a
SERIALIZABLE READ ONLY transaction with some additional property
(perhaps DEFERRED or DEFERRABLE) which would cause it to take a
snapshot and then wait until there were no overlapping serializable
transactions which are not READ ONLY which overlap a running
SERIALIZABLE transaction which is not READ ONLY. At this point it
could make a valid snapshot which would allow it to run without
taking predicate locks or checking for conflicts. It would have no
chance of being rolled back with a serialization failure *or* of
contributing to the failure of any other transaction, yet it would
be guaranteed to see a view of the database consistent with the
actions of all other serializable transactions.

One place I'm particularly interested in using such a feature is in
pg_dump. Without it we have the choice of using a SERIALIZABLE
transaction, which might fail or cause failures (which doesn't seem
good for a backup program) or using REPEATABLE READ (to get current
snapshot isolation behavior), which might capture a view of the data
which contains serialization anomalies. The notion of capturing a
backup which doesn't comply with business rules enforced by
serializable transactions gives me the willies, but it would be
better than not getting a backup reliably, so in the absence of this
feature, I think we need to change pg_dump to use REPEATABLE READ.
I can't see how to do this without keeping information on committed
transactions.

This next paragraph is copied straight from the Wiki page:

It appears that when a pivot is formed where T0 is a flagged as a
READ ONLY transaction, and it is concurrent with TN, we can wait to
see whether anything really needs to roll back. If T1 commits before
developing a rw-dependency to another transaction with a commit
early enough to make it visible to T0, the rw-dependency between T0
and T1 can be removed or ignored. It might even be worthwhile to
track whether a serializable transaction *has* written to any
permanent table, so that this optimization can be applied to de
facto READ ONLY transactions (i.e., not flagged as such, but not
having done any writes).

Again, copying from the Wiki "for the record" here:

It seems that we could guarantee that the retry of a transaction
rolled back due to a dangerous structure could never immediately
roll back on the very same conflicts if we always ensure that there
is a successful commit of one of the participating transactions
before we roll back. Is it worth it? It seems like it might be,
because it would ensure that some progress is being made and prevent
the possibility of endless flailing on any set of transactions. We
could be sure of this if we:
* use lists for inConflict and outConflict
* never roll back until we have a pivot with a commit of the
transaction on the "out" side
* never roll back the transaction being committed in the PreCommit
check
* have some way to cause another, potentially idle, transaction to
roll back with a serialization failure SQLSTATE

I'm afraid this would further boost shared memory usage, but the
payoff may well be worth it. At one point I did some "back of an
envelope" calculations, and I think I found that with 200
connections an additional 640kB of shared memory would allow this.
On top of the above optimization, just having the lists would allow
more precise recognition of dangerous structures in heavy load,
leading to fewer false positives even before you get to the above.
Right now, if you have two conflicts with different transactions in
the same direction it collapses to a self-reference, which precludes
use of optimizations involving TN committing first or T0 being READ
ONLY.

Also, if we go to these lists, I think we can provide more of the
information Dan S. has been requesting for the error detail. We
could list all transactions which participated in any failure and I
*think* we could show the statement which triggered the failure with
confidence that some relation accessed by that statement was
involved in the conflicts leading to the failure.

Less important than any of the above, but still significant in my
book, I fear that conflict recording and dangerous structure
detection could become very convoluted and fragile if we eliminate
this structure for committed transactions. Conflicts among specific
sets of transactions are the linchpin of this whole approach, and I
think that without an object to represent each one for the duration
for which it is significant is dangerous. Inferring information
from a variety of sources "feels" wrong to me.

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-23 19:30:24
Message-ID:	4C9BAAD0.8070607@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23/09/10 18:08, Kevin Grittner wrote:
> Less important than any of the above, but still significant in my
> book, I fear that conflict recording and dangerous structure
> detection could become very convoluted and fragile if we eliminate
> this structure for committed transactions. Conflicts among specific
> sets of transactions are the linchpin of this whole approach, and I
> think that without an object to represent each one for the duration
> for which it is significant is dangerous. Inferring information
> from a variety of sources "feels" wrong to me.

Ok, so if we assume that we must keep all the information we have now,
let me try again with that requirement. My aim is still to put an upper
bound on the amount of shared memory required, regardless of the number
of committed but still interesting transactions.

Cahill's thesis mentions that the per-transaction information can be
kept in a table like this:

txnID beginTime commitTime inConf outConf
100 1000 1100 N Y
101 1000 1500 N N
102 1200 N/A Y N

That maps nicely to a SLRU table, truncated from the top as entries
become old enough, and appended to the end.

In addition to that, we need to keep track of locks held by each
transaction, in a finite amount of shared memory. For each predicate
lock, we need to store the lock tag, and the list of transactions
holding the lock. The list of transactions is where the problem is,
there is no limit on its size.

Conveniently, we already have a way of representing an arbitrary set of
transactions with a single integer: multi-transactions, in multixact.c.

Now, we have a little issue in that read-only transactions don't have
xids, and can't therefore be part of a multixid, but it could be used as
a model to implement something similar for virtual transaction ids.

Just a thought, not sure what the performance would be like or how much
work such a multixid-like structure would be to implement..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-23 20:19:32
Message-ID:	4C9B70040200002500035C52@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 23/09/10 18:08, Kevin Grittner wrote:
>> Less important than any of the above, but still significant in my
>> book, I fear that conflict recording and dangerous structure
>> detection could become very convoluted and fragile if we
>> eliminate this structure for committed transactions. Conflicts
>> among specific sets of transactions are the linchpin of this
>> whole approach, and I think that without an object to represent
>> each one for the duration for which it is significant is
>> dangerous. Inferring information from a variety of sources
>>"feels" wrong to me.
>
> Ok, so if we assume that we must keep all the information we have
> now, let me try again with that requirement. My aim is still to
> put an upper bound on the amount of shared memory required,
> regardless of the number of committed but still interesting
> transactions.
>
> Cahill's thesis mentions that the per-transaction information can
> be kept in a table like this:
>
> txnID beginTime commitTime inConf outConf
> 100 1000 1100 N Y
> 101 1000 1500 N N
> 102 1200 N/A Y N
>
> That maps nicely to a SLRU table, truncated from the top as
> entries become old enough, and appended to the end.

Well, the inConf and outConf were later converted to pointers in
Cahill's work, and our MVCC implementation doesn't let us use times
quite that way -- we're using xmins and such, but I assume the point
holds regardless of such differences. (I mostly mention it to avoid
confusion for more casual followers of the thread.)

> In addition to that, we need to keep track of locks held by each
> transaction, in a finite amount of shared memory. For each
> predicate lock, we need to store the lock tag, and the list of
> transactions holding the lock. The list of transactions is where
> the problem is, there is no limit on its size.
>
> Conveniently, we already have a way of representing an arbitrary
> set of transactions with a single integer: multi-transactions, in
> multixact.c.
>
> Now, we have a little issue in that read-only transactions don't
> have xids, and can't therefore be part of a multixid, but it could
> be used as a model to implement something similar for virtual
> transaction ids.
>
> Just a thought, not sure what the performance would be like or how
> much work such a multixid-like structure would be to implement..

You're pointing toward some code I haven't yet laid eyes on, so it
will probably take me a few days to really digest your suggestion
and formulate an opinion. This is just to let you know I'm working
on it.

I really appreciate your attention to this. Thanks!

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>,<pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-24 16:17:55
Message-ID:	4C9C88E30200002500035CF0@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> My aim is still to put an upper bound on the amount of shared
> memory required, regardless of the number of committed but still
> interesting transactions.

> That maps nicely to a SLRU table

Well, that didn't take as long to get my head around as I feared.

I think SLRU would totally tank performance if used for this, and
would really not put much of a cap on the memory taken out of
circulation for purposes of caching. Transactions are not
referenced more heavily at the front of the list nor are they
necessarily discarded more or less in order of acquisition. In
transaction mixes where all transaction last about the same length
of time, the upper limit of interesting transactions is about twice
the number of active transactions, so memory demands are pretty
light. The problems come in where you have at least one long-lived
transaction and a lot of concurrent short-lived transactions. Since
all transactions are scanned for cleanup every time a transaction
completes, either they would all be taking up cache space or
performance would drop to completely abysmal levels as it pounded
disk. So SLRU in this case would be a sneaky way to effectively
dynamically allocate shared memory, but about two orders of
magnitude slower, at best.

Here are the things which I think might be done, in some
combination, to address your concern without killing performance:

(1) Mitigate memory demand through more aggressive cleanup. As an
example, a transaction which is READ ONLY (or which hasn't written
to a relevant table as tracked by a flag in the transaction
structure) is not of interest after commit, and can be immediately
cleaned up, unless there is an overlapping non-read-only transaction
which overlaps a committed transaction which wrote data. This is
clearly not a solution to your concern in itself, but it combines
with the other suggestions to make them more effective.

(2) Similar to SLRU, allocate pages from shared buffers for lists,
but pin them in memory without ever writing them to disk. A buffer
could be freed when the last list item in it was freed and the
buffer count for the list was above some minimum. This could deal
with the episodic need for larger than typical amounts of RAM
without permanently taking large quantities our of circulation.
Obviously, we would still need some absolute cap, so this by itself
doesn't answer your concern, either -- it just the impact to scale
to the need dynamically and within bounds. It has the same
effective impact on memory usage as SLRU for this application
without the same performance penalty.

(3) Here's the meat of it. When the lists hit their maximum, have
some way to gracefully degrade the accuracy of the conflict
tracking. This is similar to your initial suggestion that once a
transaction committed we would not track it in detail, but
implemented "at need" when memory resources for tracking the detail
become exhausted. I haven't worked out all the details, but I have
a rough outline in my head. I wanted to run this set of ideas past
you before I put the work in to fully develop it. This would be an
alternative to just canceling the oldest running serializable
transaction, which is the solution we could use right now to live
within some set limit, possibly with (1) or (2) to help push back
the point at which that's necessary. Rather than deterministically
canceling the oldest active transaction, it would increase the
probability of transactions being canceled because of false
positives, with the chance we'd get through the peak without any
such cancellations.

Thoughts?

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-24 17:06:35
Message-ID:	AANLkTi=bS5XeKtn6kywOpET03jNU5zWgA2W_DnBY5r99@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Thoughts?

Premature optimization is the root of all evil. I'm not convinced
that we should tinker with any of this before committing it and
getting some real-world experience. It's not going to be perfect in
the first version, just like any other major feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-24 17:35:44
Message-ID:	4C9C9B200200002500035D00@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Thoughts?
>
> Premature optimization is the root of all evil. I'm not convinced
> that we should tinker with any of this before committing it and
> getting some real-world experience. It's not going to be perfect
> in the first version, just like any other major feature.

In terms of pure optimization, I totally agree -- that's why I'm
submitting early without a number of potential optimizations. I
think we're better off getting a solid base and then attempting to
prove the merits of each optimization separately. The point Heikki
is on about, however, gets into user-facing behavior issues. The
current implementation will give users an "out of shared memory"
error if they attempt to start a SERIALIZABLE transaction when our
preallocated shared memory for tracking such transactions reaches
its limit. A fairly easy alternative would be to kill running
SERIALIZABLE transactions, starting with the oldest, until a new
request can proceed. The question is whether either of these is
acceptable behavior for an initial implementation, or whether
something fancier is needed up front.

Personally, I'd be fine with "out of shared memory" for an excess of
SERIALIZABLE transactions for now, and leave refinement for later --
I just want to be clear that there is user-visible behavior involved
here.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	drkp(at)csail(dot)mit(dot)edu, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-24 17:48:53
Message-ID:	AANLkTikydrL0P+2=aCmchwYVyVnd-b8rzOmpdsSBnoRu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 24, 2010 at 1:35 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Sep 24, 2010 at 12:17 PM, Kevin Grittner
>> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>> Thoughts?
>>
>> Premature optimization is the root of all evil. I'm not convinced
>> that we should tinker with any of this before committing it and
>> getting some real-world experience. It's not going to be perfect
>> in the first version, just like any other major feature.
>
> In terms of pure optimization, I totally agree -- that's why I'm
> submitting early without a number of potential optimizations. I
> think we're better off getting a solid base and then attempting to
> prove the merits of each optimization separately. The point Heikki
> is on about, however, gets into user-facing behavior issues. The
> current implementation will give users an "out of shared memory"
> error if they attempt to start a SERIALIZABLE transaction when our
> preallocated shared memory for tracking such transactions reaches
> its limit. A fairly easy alternative would be to kill running
> SERIALIZABLE transactions, starting with the oldest, until a new
> request can proceed. The question is whether either of these is
> acceptable behavior for an initial implementation, or whether
> something fancier is needed up front.
>
> Personally, I'd be fine with "out of shared memory" for an excess of
> SERIALIZABLE transactions for now, and leave refinement for later --
> I just want to be clear that there is user-visible behavior involved
> here.

Yeah, I understand, but I think the only changes we should make now
are things that we're sure are improvements. I haven't read the code,
but based on reading the thread so far, we're off into the realm of
speculating about trade-offs, and I'm not sure that's a good place for
us to be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	<drkp(at)csail(dot)mit(dot)edu>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-24 19:22:48
Message-ID:	4C9CB4380200002500035D10@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think the only changes we should make now are things that we're
> sure are improvements.

In that vein, anyone who is considering reviewing the patch should
check the latest from the git repo or request an incremental patch.
I've committed a few things since the last patch post, but it
doesn't seem to make sense to repost the whole thing for them. I
fixed a bug in the new shared memory list code, fixed a misleading
hint, and fixed some whitespace and comment issues.

The changes I've committed to the repo so far based on Heikki's
comments are, I feel, clear improvements. It was actually fairly
embarrassing that I didn't notice some of that myself.

> based on reading the thread so far, we're off into the realm of
> speculating about trade-offs

This latest issue seems that way to me. We're talking about
somewhere around 100 kB of shared memory in a 64 bit build with the
default number of connections, with a behavior on exhaustion which
matches what we do on normal locks. This limit is easier to hit,
and we should probably revisit it, but I am eager to get the feature
as a whole in front of people, to see how well it works for them in
other respects.

I'll be quite surprised if we've found all the corner cases, but it
is working, and working well, in a variety of tests. It has been
for months, really; I've been holding back, as requested, to avoid
distracting people from the 9.0 release.

-Kevin

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-25 12:05:17
Message-ID:	AANLkTikBZExgAwe5TifX4eMxFYyK9MKzRL8T+QTD+oAB@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> One place I'm particularly interested in using such a feature is in
> pg_dump. Without it we have the choice of using a SERIALIZABLE
> transaction, which might fail or cause failures (which doesn't seem
> good for a backup program) or using REPEATABLE READ (to get current
> snapshot isolation behavior), which might capture a view of the data
> which contains serialization anomalies.

I'm puzzled how pg_dump could possibly have serialization anomalies.
Snapshot isolation gives pg_dump a view of the database containing all
modifications committed before it started and no modifications which
committed after it started. Since pg_dump makes no database
modifications itself it can always just be taken to occur
instantaneously before any transaction which committed after it
started.

--
greg

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-25 13:31:17
Message-ID:	AANLkTim1yiqQ_ryO3zmvSyUrbuvTq6fGQzA-0zOwF06Y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ Forgot the list, resending. ]

2010/9/25 Greg Stark <gsstark(at)mit(dot)edu>:

> On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>> One place I'm particularly interested in using such a feature is in
>> pg_dump. Without it we have the choice of using a SERIALIZABLE
>> transaction, which might fail or cause failures (which doesn't seem
>> good for a backup program) or using REPEATABLE READ (to get current
>> snapshot isolation behavior), which might capture a view of the data
>> which contains serialization anomalies.
>
> I'm puzzled how pg_dump could possibly have serialization anomalies.
> Snapshot isolation gives pg_dump a view of the database containing all
> modifications committed before it started and no modifications which
> committed after it started. Since pg_dump makes no database
> modifications itself it can always just be taken to occur
> instantaneously before any transaction which committed after it
> started.

I guess that Kevin is referring to [1], where the dump would take the
role of T3. That would mean that the dump itself must be aborted
because it read inconsistent data.

AFAICS, whether that reasoning means that a dump can produce an
"inconsistent" backup is debatable. After restoring, all transactions
that would have been in-flight at the moment the dump took its
snapshot are gone, so none of their effects "happened". We would be in
exactly the same situation as if all running transactions would be
forcibly aborted at the moment that the dump would have started.

OTOH, if one would compare the backup with what really happened,
things may look inconsistent. The dump would show what T3 witnessed
(i.e., the current date is incremented and the receipts table is
empty), although the current state of the database system shows
otherwise (i.e., the current date is incremented and the receipts
table has an entry for the previous date).

IOW, one could say that the backup is consistent only if it were never
compared against the system as it continued running after the dump
took place.

This stuff will probably confuse the hell out of most DBAs :-).

Nicolas

[1] <URL:http://archives.postgresql.org/pgsql-hackers/2010-05/msg01360.php>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-25 14:45:28
Message-ID:	16766.1285425928@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> One place I'm particularly interested in using such a feature is in
>> pg_dump. Without it we have the choice of using a SERIALIZABLE
>> transaction, which might fail or cause failures (which doesn't seem
>> good for a backup program) or using REPEATABLE READ (to get current
>> snapshot isolation behavior), which might capture a view of the data
>> which contains serialization anomalies.

> I'm puzzled how pg_dump could possibly have serialization anomalies.

At the moment, it can't. If this patch means that it can, that's going
to be a mighty good reason not to apply the patch.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, drkp(at)csail(dot)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Serializable Snapshot Isolation
Date:	2010-09-26 03:15:40
Message-ID:	AANLkTinC_Szgzsmwak-yxXptXzAds2sV-PQtF2ftwZqP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 25, 2010 at 10:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> On Thu, Sep 23, 2010 at 4:08 PM, Kevin Grittner
>> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>> One place I'm particularly interested in using such a feature is in
>>> pg_dump. Without it we have the choice of using a SERIALIZABLE
>>> transaction, which might fail or cause failures (which doesn't seem
>>> good for a backup program) or using REPEATABLE READ (to get current
>>> snapshot isolation behavior), which might capture a view of the data
>>> which contains serialization anomalies.
>
>> I'm puzzled how pg_dump could possibly have serialization anomalies.
>
> At the moment, it can't. If this patch means that it can, that's going
> to be a mighty good reason not to apply the patch.

It certainly can, as can any other read-only transaction. This has
been discussed many times here before with detailed examples, mostly
by Kevin. T0 reads A and writes B. T1 then reads B and writes C. T0
commits. pg_dump runs. T1 commits. What is the fully serial order
of execution consistent with this chronology? Clearly, T1 must be run
before T0, since it doesn't see T0's update to B. But pg_dump sees
the effects of T0 but not T1, so T0 must be run before T1. Oops. Now
you might say that this won't be a problem for most people in
practice, and I think that's true, but it's still unserializable. And
pg_dump is the reason, because otherwise T1 then T0 would be a valid
serialization.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company