Quick Links

Two-phase commit

Lists:	pgsql-hackerspgsql-patches

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Two-phase commit
Date:	2004-02-04 20:22:16
Message-ID:	Pine.OSF.4.58.0402042200330.238747@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I've been very slowly continuing my work on two-phase commits for a couple
months now, and I now have my original patch updated so that it applies to
the current CVS tip, with some improvements.

The patch introduces three new commands, PREPCOMMIT, COMMITPREPARED and
ABORTPREPARED.

To start a 2PC transaction, you first do a BEGIN and your updates as
usual. At the end of the transaction, you call PREPCOMMIT 'foobar' instead
of COMMIT. Now the transaction is in prepared state, ready to commit at a
later time. 'foobar' is the global transaction identifier assigned for the
transaction.

Later, when you want to finish the second phase, you call
COMMITPREPARED 'foobar';

There is a system view pg_prepared_xacts that gives you all transactions
that are in prepared state waiting for COMMITPREPARED or ABORTPREPARED.

I have also done some work on XA-enabling the JDBC drivers, now that we
have what it takes in the server side. I have succesfully executed
2PC transactions with JBossMQ and Postgres, using JBoss as the
transaction manager, so the basic stuff seems to be working.

Please have a look and comment, the patches can be found here:
http://www.iki.fi/hlinnaka/pgsql/

What is the schedule for 7.5? Any chance of getting this in?

- Heikki

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-02-08 01:38:24
Message-ID:	200402080138.i181cPl15259@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> I've been very slowly continuing my work on two-phase commits for a couple
> months now, and I now have my original patch updated so that it applies to
> the current CVS tip, with some improvements.
>
> The patch introduces three new commands, PREPCOMMIT, COMMITPREPARED and
> ABORTPREPARED.
>
> To start a 2PC transaction, you first do a BEGIN and your updates as
> usual. At the end of the transaction, you call PREPCOMMIT 'foobar' instead
> of COMMIT. Now the transaction is in prepared state, ready to commit at a
> later time. 'foobar' is the global transaction identifier assigned for the
> transaction.
>
> Later, when you want to finish the second phase, you call
> COMMITPREPARED 'foobar';
>
> There is a system view pg_prepared_xacts that gives you all transactions
> that are in prepared state waiting for COMMITPREPARED or ABORTPREPARED.
>
> I have also done some work on XA-enabling the JDBC drivers, now that we
> have what it takes in the server side. I have succesfully executed
> 2PC transactions with JBossMQ and Postgres, using JBoss as the
> transaction manager, so the basic stuff seems to be working.
>
> Please have a look and comment, the patches can be found here:
> http://www.iki.fi/hlinnaka/pgsql/
>
> What is the schedule for 7.5? Any chance of getting this in?

7.5 is certainly possible. We are months away from beta on 7.5 and I
would like ot see two-phase commit included. One item that has come up
in past discussions is a way of recording two-phase commit failures to
the administrator in cases where you precommit, get a reply, commit,
then the remote machine disappears.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	"Jeroen T(dot) Vermeulen" <jtv(at)xs4all(dot)nl>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-02-09 20:09:34
Message-ID:	Pine.OSF.4.58.0402092201460.226205@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sun, 8 Feb 2004, Jeroen T. Vermeulen wrote:

> On Wed, Feb 04, 2004 at 10:22:16PM +0200, Heikki Linnakangas wrote:
>
> > There is a system view pg_prepared_xacts that gives you all transactions
> > that are in prepared state waiting for COMMITPREPARED or ABORTPREPARED.
>
> Great to hear that you've gotten so far with this... One question: can I
> check for this view to see if 2PC is supported before issuing the new
> kind of commit? I'm interested in supporting 2PC even for some regular
> transactions to reduce their in-doubt window, but I don't want to issue a
> command at the last moment that may fail (and thereby abort) because the
> backend version I'm connected to doesn't support the new command!

Yes, I suppose that would work. Though you would have to use a query that
wouldn't fail in case the view doesn't exist, otherwise you end up
aborting the transaction anyway. This should work:

SELECT COUNT(*) FROM pg_views WHERE schemanem='pg_catalog' AND viewname
='pg_prepared_xacts'

If it returns 1, you can do 2PC, if it returns 0, you have to regular
commit.

However, if this gets into 7.5, I guess you could just check for the
version of the backend instead with "SELECT version()".

- Heikki

From:	"Jeroen T(dot) Vermeulen" <jtv(at)xs4all(dot)nl>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-02-09 20:56:30
Message-ID:	20040209205629.GB13454@xs4all.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, Feb 09, 2004 at 10:09:34PM +0200, Heikki Linnakangas wrote:
>
> However, if this gets into 7.5, I guess you could just check for the
> version of the backend instead with "SELECT version()".

Hey, that works? That's very good news, because I was getting a bit
worried about all the things I want to do in libpqxx that may depend on
the Postgres version...

Thanks!

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-02-22 02:26:56
Message-ID:	Pine.OSF.4.58.0402220324250.126984@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sat, 7 Feb 2004, Bruce Momjian wrote:

> > Please have a look and comment, the patches can be found here:
> > http://www.iki.fi/hlinnaka/pgsql/
> >
> > What is the schedule for 7.5? Any chance of getting this in?
>
> 7.5 is certainly possible. We are months away from beta on 7.5 and I
> would like ot see two-phase commit included. One item that has come up
> in past discussions is a way of recording two-phase commit failures to
> the administrator in cases where you precommit, get a reply, commit,
> then the remote machine disappears.

You would resolve this by opening a new session, and checking if the gid
you specified in PREPARE TRANSACTION is still present in the
pg_prepared_xacts view. It could be done manually by the administrator, or
it could be done automatically by an external transaction manager if
there is one.

The XA interface specifies a function called "recover", that gives you a
list of pending transactions. If we some day have an XA implementation,
the recover call would map directly to "SELECT gid FROM
pg_prepared_xacts". The JDBC XA implementation that I'm working on does
that already.

I have updated my patches, see the URL above. I renamed the commands to
PREPARE TRANSACTION, COMMIT PREPARED and ROLLBACK PREPARED. I think it's
more coherent that way.

I also added documentation entries for the commands, and a basic
regression test.

I went through all the AtCommit_* and AtEOXact* hooks in xact.c to find
any possible problem areas. The following items have not yet been
implemented and throw an error if you try to do 2PC in the same
transaction.

* Notifications (NOTIFY/LISTEN). All pending notifications should be
stored in persistent storage in the prepare phase, and sent in the commit
phase.

* Creation/deletion of relations. I couldn't figure out how the relation
cache invalidation stuff should work with 2PC.

* Modifying GUC variables. I need to study the GUC code more thoroughly
before I can tell what needs to be done.

* Updates to shadow/group files, that is, CREATE USER and friends. Needs
some tricks to delay the writing of pg_pwd/pg_group.

* Large objects. AFAICS, no particular problem here, but I'd like to deal
with them later when the more important stuff are ok.

Plus a couple of minor details:

* Temporary tables. The seem to work somehow, but I haven't tested them
much. I have a feeling that nasty things might happen if you commit the
prepared transaction from another backend etc.

* initdb gives a warning about a missing file. It's harmless, but I
don't see how to detect that you're running under initdb. Also, if you
try to prapare a transaction with a global transaction identifier that's
already in use, you first get a warning and then an error.

I'm going to tackle the above problems later, but I would like to get
this applied to the cvs trunk with the current functionality first, after
discussion of course. The rest are nice to have for the sake of
completeness but probably not necessary for most users.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-03-23 16:10:35
Message-ID:	Pine.OSF.4.58.0403231758350.513267@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have again updated my two-phase commit patches. Only minor
modifications.

I haven't received any comments and there hasn't been any discussion on
the implementation, I suppose that nobody has given it a try. :(

There is still some rough edges, but I think it's good enough as a first
cut. I personally consider it ready to be applied to cvs tip, but then
again I'm still a newbie :). Please take a look!

The patch is also available here: http://www.iki.fi/hlinnaka/pgsql/

- Heikki

Attachment	Content-Type	Size
twophase_20040321.diff.gz	application/x-gzip	23.6 KB

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Two-phase commit
Date:	2004-03-23 18:20:49
Message-ID:	200403231820.i2NIKnu26297@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Agreed. I would like to 2-phase commit in 7.5.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> I have again updated my two-phase commit patches. Only minor
> modifications.
>
> I haven't received any comments and there hasn't been any discussion on
> the implementation, I suppose that nobody has given it a try. :(
>
> There is still some rough edges, but I think it's good enough as a first
> cut. I personally consider it ready to be applied to cvs tip, but then
> again I'm still a newbie :). Please take a look!
>
> The patch is also available here: http://www.iki.fi/hlinnaka/pgsql/
>
> - Heikki

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-03-23 18:28:43
Message-ID:	20040323182843.GF3863@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, Mar 23, 2004 at 06:10:35PM +0200, Heikki Linnakangas wrote:
> I have again updated my two-phase commit patches. Only minor
> modifications.
>
> I haven't received any comments and there hasn't been any discussion on
> the implementation, I suppose that nobody has given it a try. :(

I haven't tried it, but I see it conflicts big time with my
modifications in access/transam/xact.c for subtransactions support.

I am currently writing a proposal for nested transactions which will go
to -hackers, and I will be posting some code to -patches shortly
thereafter which should give you an idea where I am heading. Maybe then
we can have the opinion from the devel community about both things,
whether they should be applied or not.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 21:46:10
Message-ID:	354.1097099170@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Quite some time ago, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I haven't received any comments and there hasn't been any discussion on
> the implementation, I suppose that nobody has given it a try. :(

I finally got around to taking a close look at this. There's a good bit
undone, as you well know, but it seems like it can be the basis for a
workable feature. I do have a few comments to make.

At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
think you have missed a bet in that it needs to be possible to issue
"COMMIT PREPARED gid" for the same gid several times without error.
Consider a scenario where the transaction monitor crashes during the
commit phase. When it recovers, it will be aware that it had committed
to commit, but it won't know which nodes were successfully committed.
So it will need to resend the COMMIT commands. It would be bad for the
nodes to simply say "yes boss" if they are told to COMMIT a gid they
have no record of. So I think the gid's have to stick around after
COMMIT PREPARED or ROLLBACK PREPARED, and there needs to be a fourth
command (RELEASE PREPARED?) to actually remove the state data when the
transaction monitor is satisfied that everything's done. RELEASE of
an unknown gid is okay to be a no-op.

Implementation-wise, I really dislike storing the info in a shared hash
table, because I don't see any reasonable bound on the size of the hash
table (your existing code uses 100 which is about as arbitrary as it
gets). Plus the actual content of each entry is not fixed-size either.
This is not very workable given our fixed-size shared memory mechanism.

The idea that occurs to me instead is to not use WAL or shared memory at
all for keeping the prepared-transaction state info. Instead, suppose
that we store the status information in a file named after the GID,
"$PGDATA/pg_twophase/gid". We could write the file with a CRC similarly
to what's done for pg_control. Once such a file is written and fsync'd,
it's equally as reliable as a WAL record would be, so it seems safe
enough to me to report the PREPARE as done. COMMIT, ROLLBACK, and the
pg_prepared_xacts system view would look into the pg_twophase directory
to find out all about active prepared transactions; RELEASE PREPARED
would simply delete the appropriate file. (Note: commit or rollback
would need to take the transaction XID from the GID file and then look
in pg_clog to find out if the transaction were already committed. These
operations do not change the pg_twophase file, but they do write a
normal transaction-commit or -abort WAL record and update pg_clog.)

I think this would offer better performance as well as being more
scalable, because the implementation you have looks like it would have
some contention for the shared GID hashtable.

I would be inclined to require GIDs to be numbers (probably int8's)
instead of strings, so that we don't have any problems with funny
characters in the file names. That's negotiable though, as we could
certainly uuencode the strings or something to avoid that trap.

You were concerned about how to mark prepared transactions in pg_clog,
given that Alvaro had already commandeered state '11' for
subtransactions. Since only a toplevel transaction can be prepared,
it might work to allow state '11' with a zero pg_subtrans parent link
to mean a prepared transaction. This would imply factoring prepared
XIDs into GlobalXmin (so that pg_subtrans entries don't get recycled
too soon) but we probably have to do that anyway. AFAICS, prepared
but uncommitted XIDs have to be considered still InProgress, so if
they are less than GlobalXmin we'd lose.

regards, tom lane

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 22:36:47
Message-ID:	4164737F.90908@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:

> At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
> think you have missed a bet in that it needs to be possible to issue
> "COMMIT PREPARED gid" for the same gid several times without error.
> Consider a scenario where the transaction monitor crashes during the
> commit phase. When it recovers, it will be aware that it had committed
> to commit, but it won't know which nodes were successfully committed.
> So it will need to resend the COMMIT commands. It would be bad for the
> nodes to simply say "yes boss" if they are told to COMMIT a gid they
> have no record of. So I think the gid's have to stick around after
> COMMIT PREPARED or ROLLBACK PREPARED, and there needs to be a fourth
> command (RELEASE PREPARED?) to actually remove the state data when the
> transaction monitor is satisfied that everything's done. RELEASE of
> an unknown gid is okay to be a no-op.

Isn't this usually where the GTM would issue "recover" requests to
determine the state of the individual resources involved in the global
transaction, and then only commit/abort the resources that need it? (I
think the equivalent in Heikki's work is a SELECT of the
pg_prepared_xact view)

I found the Berkeley DB distributed transaction docs quite useful for
working out how two-phase commit fits together:

http://pybsddb.sourceforge.net/ref/xa/intro.html

> I would be inclined to require GIDs to be numbers (probably int8's)
> instead of strings, so that we don't have any problems with funny
> characters in the file names. That's negotiable though, as we could
> certainly uuencode the strings or something to avoid that trap.

Aren't the GIDs generated externally by the GTM? We need more than an
int8 there. See for example Heikki's JDBC driver patch: it is given a
javax.transaction.xa.Xid by the TM in prepare/commit/etc. The Xid is
basically just a couple of raw bytearrays. The driver base64-encodes
that into a string GID to give to the backend.

-O

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oliver Jowett <oliver(at)opencloud(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 22:50:42
Message-ID:	1121.1097103042@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Oliver Jowett <oliver(at)opencloud(dot)com> writes:
>> At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
>> think you have missed a bet in that it needs to be possible to issue
>> "COMMIT PREPARED gid" for the same gid several times without error.

> Isn't this usually where the GTM would issue "recover" requests to
> determine the state of the individual resources involved in the global
> transaction, and then only commit/abort the resources that need it? (I
> think the equivalent in Heikki's work is a SELECT of the
> pg_prepared_xact view)

Well, the question is how long must the individual databases retain
state with which to answer "recover" requests. I don't like "forever",
so I'm proposing that there should be an explicit command to say "you
can forget about this gid".

Note that this is only really necessary because of Heikki's choice to
make the API work in terms of a user-assigned GID. If PREPARE returned
the internal XID and then the COMMIT/ROLLBACK PREPARED statements took
the XID and not a GID, we could answer subsequent "recover" requests for
quite a long time by consulting pg_clog. But then the transaction
monitor would have to maintain the map from its GIDs to per-node XIDs,
so unless it's going to have such functionality anyway, it's reasonable
to ask the database to keep the map. (Either way, you're not going to
want to drop the mapping entry until the transaction monitor knows all
the commits are done.)

regards, tom lane

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 22:59:54
Message-ID:	416478EA.9080202@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Oliver Jowett <oliver(at)opencloud(dot)com> writes:
>
>>>At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
>>>think you have missed a bet in that it needs to be possible to issue
>>>"COMMIT PREPARED gid" for the same gid several times without error.
>
>
>>Isn't this usually where the GTM would issue "recover" requests to
>>determine the state of the individual resources involved in the global
>>transaction, and then only commit/abort the resources that need it? (I
>>think the equivalent in Heikki's work is a SELECT of the
>>pg_prepared_xact view)
>
>
> Well, the question is how long must the individual databases retain
> state with which to answer "recover" requests. I don't like "forever",
> so I'm proposing that there should be an explicit command to say "you
> can forget about this gid".

As I understand it, you don't need to keep state for committed txns,
it's only the prepared-but-not-yet-resolved txns that you have to
respond to a recover request with.

Then it seems like we already have a "forget about this GID" command for
prepared transactions: ROLLBACK PREPARED.

Probably the next question is, do we want a database-side timeout on how
long prepared txns can stay alive before being summarily rolled back?

-O

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Oliver Jowett <oliver(at)opencloud(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-10-06 23:06:52
Message-ID:	1097104012.31575.132.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 2004-10-06 at 18:50, Tom Lane wrote:
> Oliver Jowett <oliver(at)opencloud(dot)com> writes:
> >> At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
> >> think you have missed a bet in that it needs to be possible to issue
> >> "COMMIT PREPARED gid" for the same gid several times without error.
>
> > Isn't this usually where the GTM would issue "recover" requests to
> > determine the state of the individual resources involved in the global
> > transaction, and then only commit/abort the resources that need it? (I
> > think the equivalent in Heikki's work is a SELECT of the
> > pg_prepared_xact view)
>
> Well, the question is how long must the individual databases retain
> state with which to answer "recover" requests. I don't like "forever",
> so I'm proposing that there should be an explicit command to say "you
> can forget about this gid".

Isn't this exactly what the "forget" request is for in the
XACoordinator? I think it's standard for Java at the very least.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oliver Jowett <oliver(at)opencloud(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 23:07:37
Message-ID:	1299.1097104057@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Oliver Jowett <oliver(at)opencloud(dot)com> writes:
> Tom Lane wrote:
>> Well, the question is how long must the individual databases retain
>> state with which to answer "recover" requests.

> As I understand it, you don't need to keep state for committed txns,

I think that's clearly wrong:

TM --> DB: COMMIT PREPARED foo

DB does it and forgets gid foo

TM crashes and restarts

TM --> DB: what's the state of foo?

DB --> TM: go away, never heard of it

I suppose you could code the TM to treat this as meaning "it was
committed" but I think the folly of that is obvious.

> Probably the next question is, do we want a database-side timeout on how
> long prepared txns can stay alive before being summarily rolled back?

Yeah, there's another set of issues there. Personally I always thought
that 2PC was a fundamentally broken concept, because it's got so many
squirrelly cases where the guarantees you thought you were buying with
all this overhead vanish into thin air.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 23:33:11
Message-ID:	20041006233311.GA8221@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, Oct 06, 2004 at 05:46:10PM -0400, Tom Lane wrote:

> You were concerned about how to mark prepared transactions in pg_clog,
> given that Alvaro had already commandeered state '11' for
> subtransactions. Since only a toplevel transaction can be prepared,
> it might work to allow state '11' with a zero pg_subtrans parent link
> to mean a prepared transaction. This would imply factoring prepared
> XIDs into GlobalXmin (so that pg_subtrans entries don't get recycled
> too soon) but we probably have to do that anyway. AFAICS, prepared
> but uncommitted XIDs have to be considered still InProgress, so if
> they are less than GlobalXmin we'd lose.

This seems to work.

I am concerned with a different issue: what issues arise regarding
snapshots? Do concurrent xacts see a prepared one as running? I'm not
sure but I think so. So they have to be able to at least get its Xid,
no?

As soon as you have that stored somewhere, you have to ensure that an
arbitrary number of Xids, or better, snapshots, have to be somewhere.
The "100" concept does not impress me either. So if you can have an
arbitrary number of snapshots, you can as well have an arbitrary number
of WITH HOLD open cursors, without the ugly Materialize node.

Am I right?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 23:38:39
Message-ID:	416481FF.4080600@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> Oliver Jowett <oliver(at)opencloud(dot)com> writes:
>
>>Tom Lane wrote:
>>
>>>Well, the question is how long must the individual databases retain
>>>state with which to answer "recover" requests.
>
>>As I understand it, you don't need to keep state for committed txns,
>
> I think that's clearly wrong:
>
> TM --> DB: COMMIT PREPARED foo
>
> DB does it and forgets gid foo
>
> TM crashes and restarts
>
> TM --> DB: what's the state of foo?
>
> DB --> TM: go away, never heard of it
>
> I suppose you could code the TM to treat this as meaning "it was
> committed"

I believe that is exactly what the TM does. Can you take a look at
http://pybsddb.sourceforge.net/ref/xa/build.html (it's fairly brief) and
point out any holes in the logic there?

> but I think the folly of that is obvious.

It's fragile in the face of badly implemented TMs or resources, but
other than that it seems OK to me. What was the problem you were
thinking of in particular?

> Yeah, there's another set of issues there. Personally I always thought
> that 2PC was a fundamentally broken concept, because it's got so many
> squirrelly cases where the guarantees you thought you were buying with
> all this overhead vanish into thin air.

You can get around some of the issues via 3PC but that has even *more*
overhead. And noone actually implements it as far as I know :)

2PC is OK if you are only expecting it to handle certain failure modes,
e.g. no byzantine failures, no loss of stable storage, no extended
communication failures.

In general, the two-army problem just sucks..

-O

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-06 23:46:53
Message-ID:	1660.1097106413@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> I am concerned with a different issue: what issues arise regarding
> snapshots? Do concurrent xacts see a prepared one as running? I'm not
> sure but I think so. So they have to be able to at least get its Xid,
> no?

Hmm, that's a good point. It seems that uncommitted prepared XIDs
have to be included whenever a Snapshot is manufactured. That means
we need reasonably fast access to that set of XIDs, which is something
I was thinking we wouldn't need to support. It's hard to see how to
handle that without some sort of shared-memory data structure listing
those XIDs.

(This also blows out of the water the present
preallocated-space-for-the-XID-lists memory allocation in GetSnapshot,
but that was never more than the most marginal hack anyway.)

> As soon as you have that stored somewhere, you have to ensure that an
> arbitrary number of Xids, or better, snapshots, have to be somewhere.
> The "100" concept does not impress me either. So if you can have an
> arbitrary number of snapshots, you can as well have an arbitrary number
> of WITH HOLD open cursors, without the ugly Materialize node.
> Am I right?

Nope. Snapshots are not the reason we have to materialize WITH HOLD
cursors. Locks are. I suppose you could think about treating a WITH
HOLD cursor like an uncommitted prepared transaction, but I'm not sure
it's worth the overhead.

regards, tom lane

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Rod Taylor <pg(at)rbt(dot)ca>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-10-07 00:01:08
Message-ID:	41648744.5090709@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Rod Taylor wrote:
> On Wed, 2004-10-06 at 18:50, Tom Lane wrote:
>
>>Well, the question is how long must the individual databases retain
>>state with which to answer "recover" requests. I don't like "forever",
>>so I'm proposing that there should be an explicit command to say "you
>>can forget about this gid".
>
>
> Isn't this exactly what the "forget" request is for in the
> XACoordinator? I think it's standard for Java at the very least.

I think XAResource.forget() is to do with transactions that have
heuristically completed (completion of a txn without explicit directions
from the TM?), rather than the "normal" case.

-O

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-10-07 06:38:16
Message-ID:	Pine.LNX.4.61.0410070825590.17826@sablons.cri.ensmp.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

> Implementation-wise, I really dislike storing the info in a shared hash
> table, because I don't see any reasonable bound on the size of the hash
> table (your existing code uses 100 which is about as arbitrary as it
> gets). [...]
>
> The idea that occurs to me instead is to not use WAL or shared memory at
> all for keeping the prepared-transaction state info. Instead, suppose
> that we store the status information in a file named after the GID,
> "$PGDATA/pg_twophase/gid". [...]

Sorry for this stupid general comment, but why couldn't the gid be stored
in some shared system table that would rely on pg infrastructure for
caching, sharing, locking and so on? More over, that would allow the
administrator to have a look at them quite simply. Is this just a
performance issue?

--
Fabien.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-07 11:15:44
Message-ID:	Pine.OSF.4.61.0410071357420.432862@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 6 Oct 2004, Tom Lane wrote:

> Quite some time ago, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I haven't received any comments and there hasn't been any discussion on
>> the implementation, I suppose that nobody has given it a try. :(
>
> I finally got around to taking a close look at this. There's a good bit
> undone, as you well know, but it seems like it can be the basis for a
> workable feature. I do have a few comments to make.

Great!

Hmm. I don't see a problem with the "yes boss" approach. Some kind of a
warning is appropriate, of course, but I don't see a reason for an
additional step. After all, you would still fall back to the "yes boss"
approach on the RELEASE PREPARED command.

The transaction monitor knows if the 1st phase succeeded or not, so if the
COMMIT PREPARED doesn't find the transaction anymore, the monitor knows
that it's previous commit/rollback succeeded.

> Implementation-wise, I really dislike storing the info in a shared hash
> table, because I don't see any reasonable bound on the size of the hash
> table (your existing code uses 100 which is about as arbitrary as it
> gets). Plus the actual content of each entry is not fixed-size either.
> This is not very workable given our fixed-size shared memory mechanism.

I fully agree, I'm very dissatisfied with that part.

> The idea that occurs to me instead is to not use WAL or shared memory at
> all for keeping the prepared-transaction state info. Instead, suppose
> that we store the status information in a file named after the GID,
> "$PGDATA/pg_twophase/gid". We could write the file with a CRC similarly
> to what's done for pg_control. Once such a file is written and fsync'd,
> it's equally as reliable as a WAL record would be, so it seems safe
> enough to me to report the PREPARE as done. COMMIT, ROLLBACK, and the
> pg_prepared_xacts system view would look into the pg_twophase directory
> to find out all about active prepared transactions; RELEASE PREPARED
> would simply delete the appropriate file. (Note: commit or rollback
> would need to take the transaction XID from the GID file and then look
> in pg_clog to find out if the transaction were already committed. These
> operations do not change the pg_twophase file, but they do write a
> normal transaction-commit or -abort WAL record and update pg_clog.)

That sounds like a clever idea! I thought about using a single file
myself, but the multi-file approach is much simpler.

> I think this would offer better performance as well as being more
> scalable, because the implementation you have looks like it would have
> some contention for the shared GID hashtable.

I guess the performance would depend a lot on how good/bad the filesystem
is at creating and deleting a lot of small files.

I'm afraid we have to support arbitrary strings. I think at least the Java
Transaction API requires that, I'm not sure though if that could be
worked around in the JDBC driver.

Yes, they must be considered InProgress. The snapshot code needs to be
modified to handle an arbitrary number of in progress transactions.

I've been thinking if it would be useful to have the COMMIT
PREPARED/ROLLBACK PREPARED commands under transaction control themselves.
You could for example do "BEGIN; COMMIT PREPARED mygid; COMMIT PREPARED
mygid2; COMMIT;" to atomically commit two already-prepared transactions,
and even chain the 2PC transactions like "BEGIN; COMMIT PREPARED mygid;
PREPARE TRANSACTION mygid2". It seems feasible to implement, just postpone
the actual 2nd phase commit to the end of the commit of the enclosing
transaction.

- Heikki

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Oliver Jowett <oliver(at)opencloud(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-07 11:24:01
Message-ID:	Pine.OSF.4.61.0410071417260.432862@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Thu, 7 Oct 2004, Oliver Jowett wrote:

> Probably the next question is, do we want a database-side timeout on how long
> prepared txns can stay alive before being summarily rolled back?

That sounds very dangerous to me. You could end up breaking global
atomicity if some other resource in the global transaction committed.

The transaction monitor can do timeouts if necessary, and a super user has
to resolve the in-doubt transactions if the TM crashes non-recoverably.

- Heikki

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Two-phase commit
Date:	2004-10-07 14:33:51
Message-ID:	28489.1097159631@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> writes:
> Sorry for this stupid general comment, but why couldn't the gid be stored
> in some shared system table that would rely on pg infrastructure for
> caching, sharing, locking and so on?

That would have a number of pitfalls of its own:

* No outside-a-transaction access is possible. This may or may not be
essential, given Heikki's speculations elsewhere about allowing
COMMIT/ROLLBACK PREPARED to be inside transactions, but I think we'd be
foolish to rule it out in a mechanism that is itself transactional
infrastructure.

* We don't have a datatype to represent held locks, nor one for files
slated for deletion. This is fixable in itself, but more work. And do
we really want to commit to developing a datatype for every little bit
of state that may end up being associated with a GID?

* Lots and lots of short-lived entries is not the optimal performance
case for Postgres' tables. It should work well enough in a filesystem
directory though.

regards, tom lane

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2004-10-08 06:01:18
Message-ID:	41662D2E.8010104@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> On Thu, 7 Oct 2004, Oliver Jowett wrote:
>
>> Probably the next question is, do we want a database-side timeout on
>> how long prepared txns can stay alive before being summarily rolled back?
>
>
> That sounds very dangerous to me. You could end up breaking global
> atomicity if some other resource in the global transaction committed.

Right. You wouldn't enable it lightly..

> The transaction monitor can do timeouts if necessary, and a super user
> has to resolve the in-doubt transactions if the TM crashes non-recoverably.

Some systems may prefer short-term availability over atomicity. Putting
a human in the loop when doing recovery hurts your availability.

If pg_prepared_xacts had a time-of-preparation column, it would be
possible to put the timeout policy in an external client. Perhaps that's
a better solution?

-O

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Two-phase commit patch updated
Date:	2004-10-09 11:48:09
Message-ID:	Pine.OSF.4.61.0410091429320.339489@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I brought the 2PC patch up to date:

http://www.hut.fi/~hlinnaka/pgsql/

There's no new functionality, I just fixed all the bit rot so that it
applies to the current CVS tip.

- Heikki

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2005-06-05 00:45:57
Message-ID:	200506050045.j550jvi02537@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have added this thread as a link on the TODO list under TODO.detail.

I know folks are working on this for 8.1 but now they can get the
discussions all in one place.

---------------------------------------------------------------------------

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oliver Jowett <oliver(at)opencloud(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2005-06-18 16:13:22
Message-ID:	18565.1119111202@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

While cleaning out old mail about two-phase commit, I noticed this
thought from Oliver:

Oliver Jowett <oliver(at)opencloud(dot)com> writes:
>>> Probably the next question is, do we want a database-side timeout on
>>> how long prepared txns can stay alive before being summarily rolled back?
>>
>> That sounds very dangerous to me. You could end up breaking global
>> atomicity if some other resource in the global transaction committed.

> Right. You wouldn't enable it lightly..

> If pg_prepared_xacts had a time-of-preparation column, it would be
> possible to put the timeout policy in an external client. Perhaps that's
> a better solution?

This seems like a good idea to me in any case --- barring objections,
I will add this to the data structures and view.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Oliver Jowett <oliver(at)opencloud(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Two-phase commit
Date:	2005-06-18 18:08:55
Message-ID:	200506181808.j5II8tZ13686@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> While cleaning out old mail about two-phase commit, I noticed this
> thought from Oliver:
>
> Oliver Jowett <oliver(at)opencloud(dot)com> writes:
> >>> Probably the next question is, do we want a database-side timeout on
> >>> how long prepared txns can stay alive before being summarily rolled back?
> >>
> >> That sounds very dangerous to me. You could end up breaking global
> >> atomicity if some other resource in the global transaction committed.
>
> > Right. You wouldn't enable it lightly..
>
> > If pg_prepared_xacts had a time-of-preparation column, it would be
> > possible to put the timeout policy in an external client. Perhaps that's
> > a better solution?
>
> This seems like a good idea to me in any case --- barring objections,
> I will add this to the data structures and view.

I am a little confused by the use of the term "prepared" in terms of
2-phase commit vs. prepared queries. Is there a way to make the wording
clearer?