Quick Links

Serializable snapshot isolation patch

Lists:	pgsql-hackers

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Kevin(dot)Grittner(at)wicourts(dot)gov
Subject:	Serializable snapshot isolation patch
Date:	2010-10-18 05:53:45
Message-ID:	1287381225.8516.516.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is based on the Kevin's git repo at:

git://git.postgresql.org/git/users/kgrittn/postgres.git
SHA1: 729541fa5ea94d66e6f4b22fb65bfef92214cd6b

* Trivial stuff:

I get a compiler warning:

indexfsm.c: In function ‘RecordFreeIndexPage’:
indexfsm.c:55: warning: implicit declaration of function
‘PageIsPredicateLocked’

* Open issues, as I see it:

1. 2PC and SSI don't mix (this may be a known issue, because there's not
really any code in the current implementation to deal with 2PC):

Session1:
BEGIN ISOLATION LEVEL SERIALIZABLE;
select count(*) from a;
insert into a values(1);
PREPARE TRANSACTION 't1';

Session2:
BEGIN ISOLATION LEVEL SERIALIZABLE;
select count(*) from a;
insert into a values(1);
COMMIT;

Session1:
COMMIT PREPARED 't1';

Looks like we need to track information about prepared transactions in
shared memory. I think you'll need to keep the information in the 2PC
state file as well, so that it can be rebuilt after a crash or restart.
It all looks solvable at first glance, but it looks like it might be
some work.

2. I think there's a GiST bug (illustrating with PERIOD type):

create table foo(p period);
create index foo_idx on foo using gist (p);
insert into foo select period(
'2009-01-01'::timestamptz + g * '1 microsecond'::interval,
'2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval)
from generate_series(1,2000000) g;

Session1:
begin isolation level serializable;
select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
insert into foo values('[2009-01-01, 2009-01-01]'::period);

Session2:
begin isolation level serializable;
select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
insert into foo values('[2009-01-01, 2009-01-01]'::period);
commit;

Session1:
commit;

In pg_locks (didn't paste here due to formatting), it looks like the
SIRead locks are holding locks on different pages. Can you clarify your
design for GiST and the interaction with page-level locks? It looks like
you're making some assumption about which pages will be visited when
searching for conflicting values which doesn't hold true. However, that
seems odd, because even if the value is actually inserted in one
transaction, the other doesn't seem to find the conflict. Perhaps the
bug is simpler than that? Or perhaps I have some kind of odd bug in
PERIOD's gist implementation?

Also, it appears to be non-deterministic, to a degree at least, so you
may not observe the problem in the exact way that I do.

3. Limited shared memory space to hold information about committed
transactions that are still "interesting". Relevant thread:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php

It's a challenging problem, however, and the current solution is less
than ideal. Idle transactions can mean that all new serializable
transactions fail until the idle transactions start to terminate. I
don't like that very much, because people expect to have to retry
serializable transactions, but retrying here has no real hope (except
that some time has elapsed, and maybe the other transaction decided to
commit).

A comparison is made (in the aforementioned thread) to the existing
limitation on the number of locks. However, it's much easier to
understand normal locks, and for a given workload usually you can put an
upper bound on the number of locks required (right?).

Does it make sense to kill the existing transactions that are holding
everything up, rather than the new transaction? Or would that just
confuse matters more? This does not necessarily guarantee that progress
can be made, either, but intuitively it seems more likely.

4. A few final details:

a. We should probably have a compatibility GUC that makes SERIALIZABLE
equal to REPEATABLE READ. My opinion is that this should be only for
compatibility, and should default to "off" (i.e. SSI code enabled)
either in 9.1 or soon after.

b. Docs.

* Questions:

1. For TargetTagIsCoveredBy(), why is it impossible for the covering tag
to have an offset?

2. The explanation for SerializablePredicateLockListLock is a little
confusing. It makes it sound like other processes can't walk the list,
but they can modify it?

* Summary

Great patch! I didn't make it through the patch in as much detail as I
would have liked, because the theory behind it is quite complex and it
will take longer for me to absorb. But the implementation looks good and
the use case is very important.

Regards,
Jeff Davis

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-18 18:26:07
Message-ID:	4CBC4AEF0200002500036AB2@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

First off, thanks for the review! I know that it's a lot of work,
and I do appreciate it.

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> * Trivial stuff:
>
> I get a compiler warning:
>
> indexfsm.c: In function *RecordFreeIndexPage*:
> indexfsm.c:55: warning: implicit declaration of function
> *PageIsPredicateLocked*

Oops. Fixed on my git repo now.

> * Open issues, as I see it:
>
> 1. 2PC and SSI don't mix (this may be a known issue, because
> there's not really any code in the current implementation to deal
> with 2PC):

> Looks like we need to track information about prepared
> transactions in shared memory. I think you'll need to keep the
> information in the 2PC state file as well, so that it can be
> rebuilt after a crash or restart. It all looks solvable at first
> glance, but it looks like it might be some work.

Argh! I didn't anticipate an interaction with prepared
transactions, probably because I've never used them or taken a close
look at them. I'll take a look.

> 2. I think there's a GiST bug (illustrating with PERIOD type):

> Can you clarify your design for GiST and the interaction with
> page-level locks? It looks like you're making some assumption
> about which pages will be visited when searching for conflicting
> values which doesn't hold true. However, that seems odd, because
> even if the value is actually inserted in one transaction, the
> other doesn't seem to find the conflict. Perhaps the bug is
> simpler than that? Or perhaps I have some kind of odd bug in
> PERIOD's gist implementation?

My assumptions for GiST were that:

(1) A search for a matching value could bail out at any level in the
tree; there is no requirement for the search to proceed to the leaf
level to get a negative index probe.

(2) An index insertion which would cause a row to be found if an
earlier search was repeated must modify some page which was read by
the earlier search.

(3) Because of the above, it is necessary and sufficient to acquire
an SIRead lock all pages visited in a search of a GiST index, and
flag a conflict on insertion into a locked page at any level of the
index.

That logic still seems sound to me, so if someone else sees a flaw
in it, please point it out. Assuming that logic is sound, I'll poke
around to see where the flaw in implementation may be. If you have
a full self-contained test case to demonstrate the failure here,
could you send it to me?

> Also, it appears to be non-deterministic, to a degree at least, so
> you may not observe the problem in the exact way that I do.

Yeah, I have tested this area without seeing the failure , so that
self-contained example would be a big help.

> 3. Limited shared memory space to hold information about committed
> transactions that are still "interesting". Relevant thread:
>
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
>
> It's a challenging problem, however, and the current solution is
> less than ideal.

I'd go further than that and say that it clearly needs to be fixed.
The scale of the issue was somewhat disguised in my testing when I
was using the hash table for managing acquisition of shared memory
for what was essentially an unordered list. Now that I've made it a
proper list with a hard limit on entries, the problem is pretty easy
to provoke, and just reserving space for A Very Large Number Of
Entries is clearly not an adequate solution. :-(

> Idle transactions can mean that all new serializable transactions
> fail until the idle transactions start to terminate. I don't like
> that very much, because people expect to have to retry
> serializable transactions, but retrying here has no real hope
> (except that some time has elapsed, and maybe the other
> transaction decided to commit).

Agreed.

> Does it make sense to kill the existing transactions that are
> holding everything up, rather than the new transaction? Or would
> that just confuse matters more? This does not necessarily
> guarantee that progress can be made, either, but intuitively it
> seems more likely.

Canceling an old transaction to allow new transactions to begin
*might* be better than the current situation, but I think we can and
should do better. The best I have been able to come up with is in
this post:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php

There's a fair amount of work there, so I was hoping for some
confirmation that this combination of steps was both sane and had
some chance of being considered sufficient and acceptable before
diving in to code it. I also *really* hope to add the "SERIALIZABLE
READ ONLY DEFERRABLE" mode so that pg_dump and other read-only
transactions don't push things into a state where the rollback rate
spikes:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php

> 4. A few final details:
>
> a. We should probably have a compatibility GUC that makes
> SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
> should be only for compatibility, and should default to "off"
> (i.e. SSI code enabled) either in 9.1 or soon after.

I'm inclined to agree with you, with the strong preference for a 9.1
setting of off. This was previously discussed, and there were
people who felt that we should avoid a "behavior-changing" GUC like
that, so I didn't add it. It wouldn't be hard to put it in, if
that's the community consensus.

> b. Docs.
>
> * Questions:
>
> 1. For TargetTagIsCoveredBy(), why is it impossible for the
> covering tag to have an offset?

Because a "covering" lock is a lock at a coarser granularity than
the "covered" lock, which includes what the finer-grained lock
covers. Since the tuple level is as fine as we take it, and we only
use the offset for tuple locks, a covering lock would not have that
set. (We also check for duplicate locks.) I'll expand that comment
a bit, since it obviously wasn't clear enough.

> 2. The explanation for SerializablePredicateLockListLock is a
> little confusing. It makes it sound like other processes can't
> walk the list, but they can modify it?

That's a bit unconventional, I admit, but it was a big part of
bringing the benchmarks for a fully cached, read-only transaction
load down to running just 1.8% longer than REPEATABLE READ (which is
the same as current SERIALIZABLE). I'm open to other locking
strategies, or an explanation of where this one doesn't work, but
I'd want to benchmark carefully. Since you took it to mean the
right thing, but found that surprising enough to doubt that it was
accurate, do you have any particular suggestions about how to
clarify it? (Maybe just, "Yeah, that's unconventional, but it works
and is faster than the alternatives we've tried."?)

> * Summary
>
> Great patch!

Thanks!

> I didn't make it through the patch in as much detail as I would
> have liked, because the theory behind it is quite complex and it
> will take longer for me to absorb.

Yeah, there are some mind-bending ideas in there. If I recall
correctly, Dan said he spent one week reading the academic papers on
which it was based and another week reading the patch before he felt
comfortable starting to code the parts he wrote. I spent a lot more
time than that reading papers, and I still didn't quite have my
head around some of the concepts until I went back to some of
Fekete's work which led up to this innovation. While many of the
concepts go back to the '80s and '90s, I found Fekete's work
"clicked" with me better than some of the earlier, more theoretical
work because he was looking at real-world examples of how these
manifest in actual, working production software and what techniques
could be used to identify and mitigate the problems. Sometimes the
theory is hard for me to properly digest without such concrete
details.

> But the implementation looks good and the use case is very
> important.

It certainly is for our shop, and I've heard from others who are
very interested in it. I think, that it's generally of more
benefit to "big shops" than situations where you have just a few
programmers coding against a schema with just a few dozen tables;
although it's also somewhat a matter of style: I would rather deal
with *one* issue (serialization failures) in one place than scatter
defensive code all over my application codebase, even on a moderate
scale.

Anyway, given the issues raised and the date, it seems reasonable to
mark this as "Returned with Feedback" for CF management purposes.
I'll get fixes out as soon as I can.

-Kevin

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-18 20:18:41
Message-ID:	1287433121.15261.64.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2010-10-18 at 13:26 -0500, Kevin Grittner wrote:
> > 2. I think there's a GiST bug (illustrating with PERIOD type):
>
> My assumptions for GiST were that:
>
> (1) A search for a matching value could bail out at any level in the
> tree; there is no requirement for the search to proceed to the leaf
> level to get a negative index probe.
>
> (2) An index insertion which would cause a row to be found if an
> earlier search was repeated must modify some page which was read by
> the earlier search.
>
> (3) Because of the above, it is necessary and sufficient to acquire
> an SIRead lock all pages visited in a search of a GiST index, and
> flag a conflict on insertion into a locked page at any level of the
> index.
>
> That logic still seems sound to me, so if someone else sees a flaw
> in it, please point it out.

Looks sound to me, as well.

> > Also, it appears to be non-deterministic, to a degree at least, so
> > you may not observe the problem in the exact way that I do.
>
> Yeah, I have tested this area without seeing the failure , so that
> self-contained example would be a big help.

I assume here that you mean that you _did_ see the failure
(serialization error) and therefore did not see the problem? Also, are
you sure it was using the GiST index for the searches and didn't just
get a full table lock or full index lock?

I'll try to narrow it down. It could be a problem with PERIOD, or maybe
it wasn't using the GiST index for a search when I thought it was (I
didn't run EXPLAIN at every point, so I'll double-check). Clearly, a
problem exists somewhere though.

> > 3. Limited shared memory space to hold information about committed
> > transactions that are still "interesting". Relevant thread:
> >
> > http://archives.postgresql.org/pgsql-hackers/2010-09/msg01735.php
> >
> > It's a challenging problem, however, and the current solution is
> > less than ideal.
>
> I'd go further than that and say that it clearly needs to be fixed.

OK, this will remain an open issue then.

> Canceling an old transaction to allow new transactions to begin
> *might* be better than the current situation, but I think we can and
> should do better. The best I have been able to come up with is in
> this post:
>
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01724.php

At first I didn't like that approach much, but after re-reading it, I am
more optimistic. What I like most about it is that a transaction won't
be canceled if it doesn't interfere at all (e.g. operating on different
tables). There are still edge cases where it can be canceled, like when
the memory is so exhausted that it can't even track a couple table-level
locks, but that doesn't sound as bad (closer to the current restriction
on the total number of locks).

Ultimately I think this will be one of those problems with no 100% clean
solution. However, I think it's important that we try to minimize these
degenerate cases. Eventually, we may need to keep statistics about the
number of conflicts happening, and start to rate-limit new serializable
transactions (effectively reducing parallelism) to ensure that
reasonable progress can be made (hopefully faster than serial
execution).

> There's a fair amount of work there, so I was hoping for some
> confirmation that this combination of steps was both sane and had
> some chance of being considered sufficient and acceptable before
> diving in to code it. I also *really* hope to add the "SERIALIZABLE
> READ ONLY DEFERRABLE" mode so that pg_dump and other read-only
> transactions don't push things into a state where the rollback rate
> spikes:
>
> http://archives.postgresql.org/pgsql-hackers/2010-09/msg01771.php

One potential issue there is starvation. I don't see how you would
guarantee that there will ever be a point to grab a safe snapshot, even
if all of the other transactions are completing.

> > 4. A few final details:
> >
> > a. We should probably have a compatibility GUC that makes
> > SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
> > should be only for compatibility, and should default to "off"
> > (i.e. SSI code enabled) either in 9.1 or soon after.
>
> I'm inclined to agree with you, with the strong preference for a 9.1
> setting of off. This was previously discussed, and there were
> people who felt that we should avoid a "behavior-changing" GUC like
> that, so I didn't add it. It wouldn't be hard to put it in, if
> that's the community consensus.

I think that was me, but I'm OK with behavior-changing GUCs as long as
they are there for compatibility and we intend to push people toward the
new behavior in the long run (like standard_conforming_strings, but
hopefully not as long-lived).

Maybe it's not necessary at all, because your patch offers strictly
better guarantees (at the cost of more serialization failures), and if
they want the old behavior then REPEATABLE READ is already there.

> > 1. For TargetTagIsCoveredBy(), why is it impossible for the
> > covering tag to have an offset?
>
> Because a "covering" lock is a lock at a coarser granularity than
> the "covered" lock, which includes what the finer-grained lock
> covers. Since the tuple level is as fine as we take it, and we only
> use the offset for tuple locks, a covering lock would not have that
> set. (We also check for duplicate locks.) I'll expand that comment
> a bit, since it obviously wasn't clear enough.

I see. So a lock can't ever cover itself?

> > 2. The explanation for SerializablePredicateLockListLock is a
> > little confusing. It makes it sound like other processes can't
> > walk the list, but they can modify it?
>
> That's a bit unconventional, I admit, but it was a big part of
> bringing the benchmarks for a fully cached, read-only transaction
> load down to running just 1.8% longer than REPEATABLE READ (which is
> the same as current SERIALIZABLE). I'm open to other locking
> strategies, or an explanation of where this one doesn't work, but
> I'd want to benchmark carefully. Since you took it to mean the
> right thing, but found that surprising enough to doubt that it was
> accurate, do you have any particular suggestions about how to
> clarify it? (Maybe just, "Yeah, that's unconventional, but it works
> and is faster than the alternatives we've tried."?)

When using locks in an unconventional way, it would be helpful to
describe the invalid schedules that you're preventing. Perhaps an
example if you think it would be reasonably simple? Also some indication
of how another process is intended to modify the list without walking
it.

> It certainly is for our shop, and I've heard from others who are
> very interested in it. I think, that it's generally of more
> benefit to "big shops" than situations where you have just a few
> programmers coding against a schema with just a few dozen tables;
> although it's also somewhat a matter of style: I would rather deal
> with *one* issue (serialization failures) in one place than scatter
> defensive code all over my application codebase, even on a moderate
> scale.

It's a complexity-reducing and correctness-increasing feature, which is
generally good.

Regards,
Jeff Davis

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Kevin(dot)Grittner(at)wicourts(dot)gov
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 05:47:42
Message-ID:	1287640062.8516.598.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2010-10-17 at 22:53 -0700, Jeff Davis wrote:
> 2. I think there's a GiST bug (illustrating with PERIOD type):
>
> create table foo(p period);
> create index foo_idx on foo using gist (p);
> insert into foo select period(
> '2009-01-01'::timestamptz + g * '1 microsecond'::interval,
> '2009-01-01'::timestamptz + (g+1) * '1 microsecond'::interval)
> from generate_series(1,2000000) g;
>
> Session1:
> begin isolation level serializable;
> select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
> insert into foo values('[2009-01-01, 2009-01-01]'::period);
>
> Session2:
> begin isolation level serializable;
> select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
> insert into foo values('[2009-01-01, 2009-01-01]'::period);
> commit;
>
> Session1:
> commit;
>
> In pg_locks (didn't paste here due to formatting), it looks like the
> SIRead locks are holding locks on different pages. Can you clarify your
> design for GiST and the interaction with page-level locks? It looks like
> you're making some assumption about which pages will be visited when
> searching for conflicting values which doesn't hold true. However, that
> seems odd, because even if the value is actually inserted in one
> transaction, the other doesn't seem to find the conflict. Perhaps the
> bug is simpler than that? Or perhaps I have some kind of odd bug in
> PERIOD's gist implementation?
>
> Also, it appears to be non-deterministic, to a degree at least, so you
> may not observe the problem in the exact way that I do.
>

I have more information on this failure. Everything in GiST actually
looks fine. I modified the example slightly:

T1: begin isolation level serializable;
T2: begin isolation level serializable;
T1: select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
T2: select * from foo where p && '[2009-01-01, 2009-01-01]'::period;
T2: commit;
T1: commit;

The SELECTs only look at the root and the predicate doesn't match. So
each SELECT sets an SIReadLock on block 0 and exits the search. Looks
good so far.

T1 then inserts, and it has to modify page 0, so it does
FlagRWConflict(). That sets writer->inConflict = reader and
reader->outConflict = writer (where writer is T1 and reader is T2); and
T1->outConflict and T2->inConflict remain NULL.

Then T2 inserts, and I didn't catch that part in as much detail in gdb,
but it apparently has no effect on that state, so we still have
T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict = NULL, and
T2->outConflict = T1.

That looks like a reasonable state to me, but I'm not sure exactly what
the design calls for. I am guessing that the real problem is in
PreCommit_CheckForSerializationFailure(), where there are 6 conditions
that must be met for an error to be thrown. T2 falls out right away at
condition 1. T1 falls out on condition 4. I don't really understand
condition 4 at all -- can you explain it? And can you explain conditions
5 and 6 too?

Regards,
Jeff Davis

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 14:27:08
Message-ID:	4CC0076C0200002500036BF8@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

>> Also, it appears to be non-deterministic, to a degree at least,
>> so you may not observe the problem in the exact way that I do.
> The SELECTs only look at the root and the predicate doesn't match.
> So each SELECT sets an SIReadLock on block 0 and exits the search.
> Looks good so far.
>
> T1 then inserts, and it has to modify page 0, so it does
> FlagRWConflict(). That sets writer->inConflict = reader and
> reader->outConflict = writer (where writer is T1 and reader is
> T2); and T1->outConflict and T2->inConflict remain NULL.
>
> Then T2 inserts, and I didn't catch that part in as much detail in
> gdb, but it apparently has no effect on that state, so we still
> have T1->inConflict = T2, T1->outConflict = NULL, T2->inConflict =
> NULL, and T2->outConflict = T1.

I now see where the wheels fall off. The GiST query initially stops
at a high level, so predicate locks only go that deep, and the
*first* insert of a conflicting row must ripple up and modify a
locked page; but *subsequent* inserts may only need to modify the
leaf level. Even though your particular example doesn't involve a
cycle and therefore doesn't require a rollback for correctness
(although it might tend to generate a false positive if index page
locks were working correctly), you've exposed a flaw in the
GiST AM implementation of predicate locks.

On a first glance, it appears that we would need to check for
conflicts as we move down through the index to find the right spot
for an insert, not as we modify pages for the insert. I hope
there's some more subtle technique or some way to qualify it;
otherwise a search which stops at the root page would generate a
conflict out to just about any index insertion from a concurrent
transaction.

I will add this to my list of issues to fix based on your review,
unless it's something you would like to tackle -- I'm not going to
chase away anyone who wants to help with this. :-)

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 15:29:02
Message-ID:	4CC015EE0200002500036C01@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> That looks like a reasonable state to me, but I'm not sure exactly
> what the design calls for. I am guessing that the real problem is
> in PreCommit_CheckForSerializationFailure(), where there are 6
> conditions that must be met for an error to be thrown. T2 falls
> out right away at condition 1. T1 falls out on condition 4. I
> don't really understand condition 4 at all -- can you explain it?
> And can you explain conditions 5 and 6 too?

Since most transactions are rolled back on a conflict detection
during a read or write attempt, there are only a few very specific
conditions which can "slip through" to where they need to be
detected on commit. Here's the code with the six conditions:

if (MySerializableXact->inConflict != InvalidSerializableXact
&& MySerializableXact->inConflict != MySerializableXact
&& !(MySerializableXact->inConflict->rolledBack)
&& MySerializableXact->inConflict->inConflict !=
InvalidSerializableXact
&& !SxactIsCommitted(MySerializableXact->inConflict)
&& !SxactIsCommitted(MySerializableXact->inConflict->inConflict))

Condition 4 is testing whether MySerializableXact is on the "out"
side of a pivot -- in the parlance of most examples, is
MySerializableXact TN?

Condition 5 and 6 confirm that neither T0 nor T1 have committed
first; we can only have a problem if TN commits first.

Basically, when we already have a pivot, but no transaction has yet
committed, we wait to see if TN commits first. If so, we have a
problem; if not, we don't. There's probably some room for improving
performance by cancelling T0 or T1 instead of TN, at least some of
the time; but in this pass we are always cancelling the transaction
in whose process we detect the need to cancel something.

-Kevin

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 16:47:49
Message-ID:	1287679669.8516.618.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-10-21 at 10:29 -0500, Kevin Grittner wrote:
> Basically, when we already have a pivot, but no transaction has yet
> committed, we wait to see if TN commits first. If so, we have a
> problem; if not, we don't. There's probably some room for improving
> performance by cancelling T0 or T1 instead of TN, at least some of
> the time; but in this pass we are always cancelling the transaction
> in whose process we detect the need to cancel something.

Well, in this case we do clearly have a problem, because the result is
not equal to the serial execution of the transactions in either order.

So the question is: at what point is the logic wrong? It's either:
1. PreCommit_CheckForSerializationFailure() is missing a failure case.
2. The state prior to entering that function (which I believe I
sufficiently described) is wrong.

If it's (2), then what should the state look like, and how is the GiST
code supposed to result in that state?

I know some of these questions are answered in the relevant research,
but I'd like to discuss this concrete example specifically.

Regards,
Jeff Davis

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 18:37:50
Message-ID:	4CC0422E0200002500036C46@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> in this case we do clearly have a problem, because the result is
> not equal to the serial execution of the transactions in either
> order.

Yeah, you're right. I misread that example -- newbie with the
PERIOD type.

> So the question is: at what point is the logic wrong? It's either:
> 1. PreCommit_CheckForSerializationFailure() is missing a failure
> case.
> 2. The state prior to entering that function (which I believe I
> sufficiently described) is wrong.

It's (2). For the reasons I described in my previous email. Even
though misread the specifics of your example, I was close enough to
see where the problem was accurately. :-/

> If it's (2), then what should the state look like, and how is the
> GiST code supposed to result in that state?

The second insert should create conflicts similar to what the first
did, but in the other direction -- simple write skew. How GiST is
supposed to catch this is the big question. My logic that a
conflicting insert will modify a page read by the other transaction
only holds until someone inserts a conflicting entry. That's why it
wasn't reliably failing until you had and example where both
transactions accessing the same leaf page.

In your example, session 1's insert creates the leaf entry and
propagates entries up to the root. When session 2 inserts, it
can just modify the leaf, so the conflict is missed. As I said, the
most obvious way to fix this is to look for conflicts while
descending to the leaf for an insert. I'm almost sure we can do
better than that, but I haven't finished thinking it through. A
rough idea might be that when we find a conflict on an insert, we
acquire additional predicate locks on everything between the lowest
point of conflict and the leaf; the rest of the logic would remain
as-is. I haven't finished mulling that over, but it seems likely to
work. If we did that, session 2 would detect the conflict on the
insert to the leaf, and all would be well.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Davis" <pgsql(at)j-davis(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Serializable snapshot isolation patch
Date:	2010-10-21 23:33:12
Message-ID:	4CC087680200002500036C93@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> When using locks in an unconventional way, it would be helpful to
> describe the invalid schedules that you're preventing. Perhaps an
> example if you think it would be reasonably simple? Also some
> indication of how another process is intended to modify the list
> without walking it.

I've just pushed some comment changes intended to address this. Did
I hit the mark?

-Kevin

P.S. Sorry for the delay in responding to such simple requests --
I've been tied up with a family medical crisis; I hope to crank
through much of what you've raised this weekend.