Re: Big 7.4 items

Lists: pgsql-hackers
From: <darren(at)up(dot)hrcoxmail(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items
Date: 2002-12-13 21:33:34
Message-ID: 20021213213334.USFN25316.lakecmmtao01.coxmail.com@lakecmmtab01
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> Darren, can you clarify this? Why does it send that message? How does
> it allow commits not to wait for ordered writesets?
>

There are two channels. One for total order writesets
(changes to the DB). The other is simple order for
aborts, commits, joins (systems joining the replica), etc.
The simple channel is necessary, because we don't want to
wait for total ordered changes to get an abort message and
so forth. In some cases you might get an abort or a commit
message before you get the writeset it refers to.

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

Hope that helps,

Darren


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: darren(at)up(dot)hrcoxmail(dot)com
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items
Date: 2002-12-13 21:53:52
Message-ID: 3DFA56F0.B74DEACC@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

darren(at)up(dot)hrcoxmail(dot)com wrote:
>
> >
> >
> > Darren, can you clarify this? Why does it send that message? How does
> > it allow commits not to wait for ordered writesets?
> >
>
> There are two channels. One for total order writesets
> (changes to the DB). The other is simple order for
> aborts, commits, joins (systems joining the replica), etc.
> The simple channel is necessary, because we don't want to
> wait for total ordered changes to get an abort message and
> so forth. In some cases you might get an abort or a commit
> message before you get the writeset it refers to.
>
> Lets say we have systems A, B and C. Each one has some
> changes and sends a writeset to the group communication
> system (GSC). The total order dictates WS(A), WS(B), and
> WS(C) and the writes sets are recieved in that order at
> each system. Now C gets WS(A) no conflict, gets WS(B) no
> conflict, and receives WS(C). Now C can commit WS(C) even
> before the commit messages C(A) or C(B), because there is no
> conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous.

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

>
> Hope that helps,
>
> Darren

Hope this doesn't add too much confusion :-)

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: darren(at)up(dot)hrcoxmail(dot)com
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items
Date: 2002-12-13 22:11:22
Message-ID: 200212132211.gBDMBMU14075@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

darren(at)up(dot)hrcoxmail(dot)com wrote:
> >
> >
> > Darren, can you clarify this? Why does it send that message? How does
> > it allow commits not to wait for ordered writesets?
> >
>
> There are two channels. One for total order writesets
> (changes to the DB). The other is simple order for
> aborts, commits, joins (systems joining the replica), etc.
> The simple channel is necessary, because we don't want to
> wait for total ordered changes to get an abort message and
> so forth. In some cases you might get an abort or a commit
> message before you get the writeset it refers to.
>
> Lets say we have systems A, B and C. Each one has some
> changes and sends a writeset to the group communication
> system (GSC). The total order dictates WS(A), WS(B), and
> WS(C) and the writes sets are recieved in that order at
> each system. Now C gets WS(A) no conflict, gets WS(B) no
> conflict, and receives WS(C). Now C can commit WS(C) even
> before the commit messages C(A) or C(B), because there is no
> conflict.

Oh, so C doesn't apply A's changes until it see A's commit, but it can
continue with its own changes because there is no conflict?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Darren Johnson <darren(at)up(dot)hrcoxmail(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items
Date: 2002-12-14 01:28:24
Message-ID: 3DFA8938.3020601@up.hrcoxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
>>
>>Lets say we have systems A, B and C. Each one has some
>>changes and sends a writeset to the group communication
>>system (GSC). The total order dictates WS(A), WS(B), and
>>WS(C) and the writes sets are recieved in that order at
>>each system. Now C gets WS(A) no conflict, gets WS(B) no
>>conflict, and receives WS(C). Now C can commit WS(C) even
>>before the commit messages C(A) or C(B), because there is no
>>conflict.
>>
>
>And that is IMHO not synchronous. C does not have to wait for A and B to
>finish the same tasks. If now at this very moment two new transactions
>query system A and system C (assuming A has not yet committed WS(C)
>while C has), they will get different data back (thanks to non-blocking
>reads). I think this is pretty asynchronous.
>

So if we hold WS(C) until we receive commit messages for WS(A) and
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit the
WS in the same order
and at the same exact time.

>
>
>It doesn't lead to inconsistencies, because the transaction on A cannot
>do something that is in conflict with the changes made by WS(C), since
>it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
>arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
>in this context).
>
Right

>
>Hope this doesn't add too much confusion :-)
>
No, however I guess I need to adjust my slides to include your
definition of synchronous
replication. ;-)

Darren

>


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items - Replication
Date: 2002-12-14 12:51:13
Message-ID: 01d101c2a36f$7c529740$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For live replication could I propose that we consider the systems A,B, and C
connected to each other independantly (i.e. A has links to B and C, B has
links to A and C, and C has links to A and B), and that replication is
handled by the node receiving the write based transaction.

If we consider a write transaction that arrives at A (called WT(A)), system
A will then send WT(A) to systems B and C via it's direct connections.
System A will receive back either an OK response if there are not conflicts,
a NOT_OK response if there are conflicts, or no response if the system is
unavailable.

If system A receives a NOT_OK response from any other node it begins the
process of rolling back the transaction from all nodes which previously
issued an OK, and the transaction returns a failure code to the client which
submitted WT(A). The other systems (B and C) would track recent transactions
and there would be a specified timeout after which the transaction is
considered safe and could not be rolled out.

Any system not returning an OK or NOT_OK state is assumed to be down, and
error messages are logged to state that the transaction could not be sent to
the system due it it's unavailablility, and any monitoring system would
alter the administrator that a replicant is faulty.

There would also need to be code developed to ensure that a system could be
brought into sync with the current state of other systems within the group
in order to allow new databases to be added, and faulty databases to be
re-entered to the group. This code could also be used for non-realtime
replication to allow databases to be syncronised with the live master.

This would give a multi-master solution whereby a write transaction to any
one node would guarentee that all available replicants would also hold the
data once it is completed, and would also provide the code to handle
scenarios where non-realtime data replication is required.

This system assumes that a majority of transactions will be sucessful (which
should be the case for a well designed system).

Comments?

Al.

----- Original Message -----
From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
To: "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>;
<shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
<pgsql-hackers(at)postgresql(dot)org>
Sent: Saturday, December 14, 2002 1:28 AM
Subject: [mail] Re: [HACKERS] Big 7.4 items

> >
> >
> >>
> >>Lets say we have systems A, B and C. Each one has some
> >>changes and sends a writeset to the group communication
> >>system (GSC). The total order dictates WS(A), WS(B), and
> >>WS(C) and the writes sets are recieved in that order at
> >>each system. Now C gets WS(A) no conflict, gets WS(B) no
> >>conflict, and receives WS(C). Now C can commit WS(C) even
> >>before the commit messages C(A) or C(B), because there is no
> >>conflict.
> >>
> >
> >And that is IMHO not synchronous. C does not have to wait for A and B to
> >finish the same tasks. If now at this very moment two new transactions
> >query system A and system C (assuming A has not yet committed WS(C)
> >while C has), they will get different data back (thanks to non-blocking
> >reads). I think this is pretty asynchronous.
> >
>
> So if we hold WS(C) until we receive commit messages for WS(A) and
> WS(B), will that meet
> your synchronous expectations, or do all the systems need to commit the
> WS in the same order
> and at the same exact time.
>
> >
> >
> >It doesn't lead to inconsistencies, because the transaction on A cannot
> >do something that is in conflict with the changes made by WS(C), since
> >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> >arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
> >in this context).
> >
> Right
>
> >
> >Hope this doesn't add too much confusion :-)
> >
> No, however I guess I need to adjust my slides to include your
> definition of synchronous
> replication. ;-)
>
> Darren
>
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Al Sutton <al(at)alsutton(dot)com>
Cc: Darren Johnson <darren(at)up(dot)hrcoxmail(dot)com>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items - Replication
Date: 2002-12-14 16:59:28
Message-ID: 200212141659.gBEGxS822196@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


This sounds like two-phase commit. While it will work, it is probably
slower than Postgres-R's method.

---------------------------------------------------------------------------

Al Sutton wrote:
> For live replication could I propose that we consider the systems A,B, and C
> connected to each other independantly (i.e. A has links to B and C, B has
> links to A and C, and C has links to A and B), and that replication is
> handled by the node receiving the write based transaction.
>
> If we consider a write transaction that arrives at A (called WT(A)), system
> A will then send WT(A) to systems B and C via it's direct connections.
> System A will receive back either an OK response if there are not conflicts,
> a NOT_OK response if there are conflicts, or no response if the system is
> unavailable.
>
> If system A receives a NOT_OK response from any other node it begins the
> process of rolling back the transaction from all nodes which previously
> issued an OK, and the transaction returns a failure code to the client which
> submitted WT(A). The other systems (B and C) would track recent transactions
> and there would be a specified timeout after which the transaction is
> considered safe and could not be rolled out.
>
> Any system not returning an OK or NOT_OK state is assumed to be down, and
> error messages are logged to state that the transaction could not be sent to
> the system due it it's unavailablility, and any monitoring system would
> alter the administrator that a replicant is faulty.
>
> There would also need to be code developed to ensure that a system could be
> brought into sync with the current state of other systems within the group
> in order to allow new databases to be added, and faulty databases to be
> re-entered to the group. This code could also be used for non-realtime
> replication to allow databases to be syncronised with the live master.
>
> This would give a multi-master solution whereby a write transaction to any
> one node would guarentee that all available replicants would also hold the
> data once it is completed, and would also provide the code to handle
> scenarios where non-realtime data replication is required.
>
> This system assumes that a majority of transactions will be sucessful (which
> should be the case for a well designed system).
>
> Comments?
>
> Al.
>
>
>
>
>
>
> ----- Original Message -----
> From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> To: "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
> Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>;
> <shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
> <pgsql-hackers(at)postgresql(dot)org>
> Sent: Saturday, December 14, 2002 1:28 AM
> Subject: [mail] Re: [HACKERS] Big 7.4 items
>
>
> > >
> > >
> > >>
> > >>Lets say we have systems A, B and C. Each one has some
> > >>changes and sends a writeset to the group communication
> > >>system (GSC). The total order dictates WS(A), WS(B), and
> > >>WS(C) and the writes sets are recieved in that order at
> > >>each system. Now C gets WS(A) no conflict, gets WS(B) no
> > >>conflict, and receives WS(C). Now C can commit WS(C) even
> > >>before the commit messages C(A) or C(B), because there is no
> > >>conflict.
> > >>
> > >
> > >And that is IMHO not synchronous. C does not have to wait for A and B to
> > >finish the same tasks. If now at this very moment two new transactions
> > >query system A and system C (assuming A has not yet committed WS(C)
> > >while C has), they will get different data back (thanks to non-blocking
> > >reads). I think this is pretty asynchronous.
> > >
> >
> > So if we hold WS(C) until we receive commit messages for WS(A) and
> > WS(B), will that meet
> > your synchronous expectations, or do all the systems need to commit the
> > WS in the same order
> > and at the same exact time.
> >
> > >
> > >
> > >It doesn't lead to inconsistencies, because the transaction on A cannot
> > >do something that is in conflict with the changes made by WS(C), since
> > >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> > >arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
> > >in this context).
> > >
> > Right
> >
> > >
> > >Hope this doesn't add too much confusion :-)
> > >
> > No, however I guess I need to adjust my slides to include your
> > definition of synchronous
> > replication. ;-)
> >
> > Darren
> >
> > >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Mathieu Arnold <mat(at)mat(dot)cc>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Big 7.4 items - Replication
Date: 2002-12-14 17:03:18
Message-ID: 7959272.1039888998@cmantatzi.in.t-online.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--En cette belle journée de samedi 14 décembre 2002 11:59 -0500,
-- Bruce Momjian écrivait avec ses petits doigts :
>
> This sounds like two-phase commit. While it will work, it is probably
> slower than Postgres-R's method.

What exactly is Postgres-R's method ?

--
Mathieu Arnold


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-14 18:18:10
Message-ID: 06b801c2a39d$29915ae0$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I see it as very difficult to avoid a two stage process because there will
be the following two parts to any transaction;

1) All databases must agree upon the acceptability of a transaction before
the client can be informed of it's success. 2) All databases must be
informed as to whether or not the transaction was accepted by the entire
replicant set, and thus whether it should be written to the database.

If stage1 is missed then the client application may be informed of a
sucessful transaction which may fail when it is replicated to other
databases.

If stage 2 is missed then databases may become out of sync because they have
accepted transactions that were rejected by other databases.

From reading the PDF on Postgres-R I can see that either one of two things
will occur;

a) There will be a central point of synchronization where conflicts will be
tested and delt with. This is not desirable because it will leave the
synchronization and replication processing load concentrated in one place
which will limit scaleability as well as leaving a single point of failure.

or

b) The Group Communication blob will consist of a number of processes which
need to talk to all of the others to interrogate them for changes which may
conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution with
phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things, but
this would mean the loss of a number of databases if the machine running the
group communication process for the set of databases is lost.

Al.

----- Original Message -----
From: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "Al Sutton" <al(at)alsutton(dot)com>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Jan Wieck"
<JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Sent: Saturday, December 14, 2002 4:59 PM
Subject: [mail] Re: [HACKERS] Big 7.4 items - Replication

>
> This sounds like two-phase commit. While it will work, it is probably
> slower than Postgres-R's method.
>
> --------------------------------------------------------------------------
-
>
> Al Sutton wrote:
> > For live replication could I propose that we consider the systems A,B,
and C
> > connected to each other independantly (i.e. A has links to B and C, B
has
> > links to A and C, and C has links to A and B), and that replication is
> > handled by the node receiving the write based transaction.
> >
> > If we consider a write transaction that arrives at A (called WT(A)),
system
> > A will then send WT(A) to systems B and C via it's direct connections.
> > System A will receive back either an OK response if there are not
conflicts,
> > a NOT_OK response if there are conflicts, or no response if the system
is
> > unavailable.
> >
> > If system A receives a NOT_OK response from any other node it begins the
> > process of rolling back the transaction from all nodes which previously
> > issued an OK, and the transaction returns a failure code to the client
which
> > submitted WT(A). The other systems (B and C) would track recent
transactions
> > and there would be a specified timeout after which the transaction is
> > considered safe and could not be rolled out.
> >
> > Any system not returning an OK or NOT_OK state is assumed to be down,
and
> > error messages are logged to state that the transaction could not be
sent to
> > the system due it it's unavailablility, and any monitoring system would
> > alter the administrator that a replicant is faulty.
> >
> > There would also need to be code developed to ensure that a system could
be
> > brought into sync with the current state of other systems within the
group
> > in order to allow new databases to be added, and faulty databases to be
> > re-entered to the group. This code could also be used for non-realtime
> > replication to allow databases to be syncronised with the live master.
> >
> > This would give a multi-master solution whereby a write transaction to
any
> > one node would guarentee that all available replicants would also hold
the
> > data once it is completed, and would also provide the code to handle
> > scenarios where non-realtime data replication is required.
> >
> > This system assumes that a majority of transactions will be sucessful
(which
> > should be the case for a well designed system).
> >
> > Comments?
> >
> > Al.
> >
> >
> >
> >
> >
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> > To: "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
> > Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>;
> > <shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
> > <pgsql-hackers(at)postgresql(dot)org>
> > Sent: Saturday, December 14, 2002 1:28 AM
> > Subject: [mail] Re: [HACKERS] Big 7.4 items
> >
> >
> > > >
> > > >
> > > >>
> > > >>Lets say we have systems A, B and C. Each one has some
> > > >>changes and sends a writeset to the group communication
> > > >>system (GSC). The total order dictates WS(A), WS(B), and
> > > >>WS(C) and the writes sets are recieved in that order at
> > > >>each system. Now C gets WS(A) no conflict, gets WS(B) no
> > > >>conflict, and receives WS(C). Now C can commit WS(C) even
> > > >>before the commit messages C(A) or C(B), because there is no
> > > >>conflict.
> > > >>
> > > >
> > > >And that is IMHO not synchronous. C does not have to wait for A and B
to
> > > >finish the same tasks. If now at this very moment two new
transactions
> > > >query system A and system C (assuming A has not yet committed WS(C)
> > > >while C has), they will get different data back (thanks to
non-blocking
> > > >reads). I think this is pretty asynchronous.
> > > >
> > >
> > > So if we hold WS(C) until we receive commit messages for WS(A) and
> > > WS(B), will that meet
> > > your synchronous expectations, or do all the systems need to commit
the
> > > WS in the same order
> > > and at the same exact time.
> > >
> > > >
> > > >
> > > >It doesn't lead to inconsistencies, because the transaction on A
cannot
> > > >do something that is in conflict with the changes made by WS(C),
since
> > > >it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
> > > >arriving at A will cause WS(A)2 to rollback (WS used synonymous to
Xact
> > > >in this context).
> > > >
> > > Right
> > >
> > > >
> > > >Hope this doesn't add too much confusion :-)
> > > >
> > > No, however I guess I need to adjust my slides to include your
> > > definition of synchronous
> > > replication. ;-)
> > >
> > > Darren
> > >
> > > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 6: Have you searched our list archives?
> > >
> > > http://archives.postgresql.org
> > >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania
19073
>


From: Darren Johnson <darren(at)up(dot)hrcoxmail(dot)com>
To: Al Sutton <al(at)alsutton(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-14 18:48:06
Message-ID: 3DFB7CE6.6090004@up.hrcoxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
>
>b) The Group Communication blob will consist of a number of processes which
>need to talk to all of the others to interrogate them for changes which may
>conflict with the current write that being handled and then issue the
>transaction response. This is basically the two phase commit solution with
>phases moved into the group communication process.
>
>I can see the possibility of using solution b and having less group
>communication processes than databases as attempt to simplify things, but
>this would mean the loss of a number of databases if the machine running the
>group communication process for the set of databases is lost.
>
The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server. It
has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

>


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-15 10:16:22
Message-ID: 002a01c2a423$050f3ad0$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Many thanks for the explanation. Could you explain to me where the order or
the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating on
each database, a successful result for T1(X) will returned to the client
talking to database A because T2(X) has not reached it, and thus no conflict
is known about, and a sucessful result is returned to the client submitting
T2(X) to database B because it is not aware of T1(X). This would mean that
the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
To: "Al Sutton" <al(at)alsutton(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
<JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

> >
> >
> >
> >b) The Group Communication blob will consist of a number of processes
which
> >need to talk to all of the others to interrogate them for changes which
may
> >conflict with the current write that being handled and then issue the
> >transaction response. This is basically the two phase commit solution
with
> >phases moved into the group communication process.
> >
> >I can see the possibility of using solution b and having less group
> >communication processes than databases as attempt to simplify things, but
> >this would mean the loss of a number of databases if the machine running
the
> >group communication process for the set of databases is lost.
> >
> The group communication system doesn't just run on one system. For
> postgres-r using spread
> there is actually a spread daemon that runs on each database server. It
> has nothing to do with
> detecting the conflicts. Its job is to deliver messages in a total
> order for writesets or simple order
> for commits, aborts, joins, etc.
>
> The detection of conflicts will be done at the database level, by a
> backend processes. The basic
> concept is "if all databases get the writesets (changes) in the exact
> same order, apply them in a
> consistent order, avoid conflicts, then one copy serialization is
> achieved. (one copy of the database
> replicated across all databases in the replica)
>
> I hope that explains the group communication system's responsibility.
>
> Darren
>
>
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


From: David Walker <pgsql(at)grax(dot)com>
To: "Al Sutton" <al(at)alsutton(dot)com>, "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-15 14:29:58
Message-ID: 200212150829.58784@grx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Another concern I have with multi-master systems is what happens if the
network splits in 2 so that 2 master systems are taking commits for 2
separate sets of clients. It seems to me that to re-sync the 2 databases
upon the network healing would be a very complex task or impossible task.

On Sunday 15 December 2002 04:16 am, Al Sutton wrote:
> Many thanks for the explanation. Could you explain to me where the order or
> the writeset for the following scenario;
>
> If a tranasction takes 50ms to reach one database from another, for a
> specific data element (called X), the following timeline occurs
>
> at 0ms, T1(X) is written to system A.
> at 10ms, T2(X) is written to system B.
>
> Where T1(X) and T2(X) conflict.
>
> My concern is that if the Group Communication Daemon (gcd) is operating on
> each database, a successful result for T1(X) will returned to the client
> talking to database A because T2(X) has not reached it, and thus no
> conflict is known about, and a sucessful result is returned to the client
> submitting T2(X) to database B because it is not aware of T1(X). This would
> mean that the two clients beleive bothe T1(X) and T2(X) completed
> succesfully, yet they can not due to the conflict.
>
> Thanks,
>
> Al.
>
> ----- Original Message -----
> From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> To: "Al Sutton" <al(at)alsutton(dot)com>
> Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
> <JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
> "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
> Sent: Saturday, December 14, 2002 6:48 PM
> Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
>
> > >b) The Group Communication blob will consist of a number of processes
>
> which
>
> > >need to talk to all of the others to interrogate them for changes which
>
> may
>
> > >conflict with the current write that being handled and then issue the
> > >transaction response. This is basically the two phase commit solution
>
> with
>
> > >phases moved into the group communication process.
> > >
> > >I can see the possibility of using solution b and having less group
> > >communication processes than databases as attempt to simplify things,
> > > but this would mean the loss of a number of databases if the machine
> > > running
>
> the
>
> > >group communication process for the set of databases is lost.
> >
> > The group communication system doesn't just run on one system. For
> > postgres-r using spread
> > there is actually a spread daemon that runs on each database server. It
> > has nothing to do with
> > detecting the conflicts. Its job is to deliver messages in a total
> > order for writesets or simple order
> > for commits, aborts, joins, etc.
> >
> > The detection of conflicts will be done at the database level, by a
> > backend processes. The basic
> > concept is "if all databases get the writesets (changes) in the exact
> > same order, apply them in a
> > consistent order, avoid conflicts, then one copy serialization is
> > achieved. (one copy of the database
> > replicated across all databases in the replica)
> >
> > I hope that explains the group communication system's responsibility.
> >
> > Darren
> >
> >
> >
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: Have you checked our extensive FAQ?
> >
> > http://www.postgresql.org/users-lounge/docs/faq.html
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "David Walker" <pgsql(at)grax(dot)com>, "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-15 15:06:04
Message-ID: 00e801c2a44b$8af8d160$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

This can be resolved by requiring that for any transaction to succeed the
entrypoint database must receive acknowlegements from n/2 + 0.5 (rounded up
to the nearest integer) databases where n is the total number in the
replicant set. The following cases are shown as an example;

Total Number of databases: 2
Number required to accept transaction: 2

Total Number of databases: 3
Number required to accept transaction: 2

Total Number of databases: 4
Number required to accept transaction: 3

Total Number of databases: 5
Number required to accept transaction: 3

Total Number of databases: 6
Number required to accept transaction: 4

Total Number of databases: 7
Number required to accept transaction: 4

Total Number of databases: 8
Number required to accept transaction: 5

This would prevent two replicant sub-sets forming, because it is impossible
for both sets to have over 50% of the databases.

Applications could be able to detect when a database has dropped out of the
replicant set because the database could report a state of "Unable to obtain
majority consesus". This would allow applications differentiate between a
database out of the set where writing to other databases in the set could
yield a sucessful result, and "Unable to commit due to conflict" where
trying other databases is pointless.

Al

Example
----- Original Message -----
From: "David Walker" <pgsql(at)grax(dot)com>
To: "Al Sutton" <al(at)alsutton(dot)com>; "Darren Johnson"
<darren(at)up(dot)hrcoxmail(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
<JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Sent: Sunday, December 15, 2002 2:29 PM
Subject: Re: [MLIST] Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

> Another concern I have with multi-master systems is what happens if the
> network splits in 2 so that 2 master systems are taking commits for 2
> separate sets of clients. It seems to me that to re-sync the 2 databases
> upon the network healing would be a very complex task or impossible task.
>
> On Sunday 15 December 2002 04:16 am, Al Sutton wrote:
> > Many thanks for the explanation. Could you explain to me where the order
or
> > the writeset for the following scenario;
> >
> > If a tranasction takes 50ms to reach one database from another, for a
> > specific data element (called X), the following timeline occurs
> >
> > at 0ms, T1(X) is written to system A.
> > at 10ms, T2(X) is written to system B.
> >
> > Where T1(X) and T2(X) conflict.
> >
> > My concern is that if the Group Communication Daemon (gcd) is operating
on
> > each database, a successful result for T1(X) will returned to the
client
> > talking to database A because T2(X) has not reached it, and thus no
> > conflict is known about, and a sucessful result is returned to the
client
> > submitting T2(X) to database B because it is not aware of T1(X). This
would
> > mean that the two clients beleive bothe T1(X) and T2(X) completed
> > succesfully, yet they can not due to the conflict.
> >
> > Thanks,
> >
> > Al.
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> > To: "Al Sutton" <al(at)alsutton(dot)com>
> > Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
> > <JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
> > "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
> > Sent: Saturday, December 14, 2002 6:48 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> > > >b) The Group Communication blob will consist of a number of processes
> >
> > which
> >
> > > >need to talk to all of the others to interrogate them for changes
which
> >
> > may
> >
> > > >conflict with the current write that being handled and then issue the
> > > >transaction response. This is basically the two phase commit solution
> >
> > with
> >
> > > >phases moved into the group communication process.
> > > >
> > > >I can see the possibility of using solution b and having less group
> > > >communication processes than databases as attempt to simplify things,
> > > > but this would mean the loss of a number of databases if the machine
> > > > running
> >
> > the
> >
> > > >group communication process for the set of databases is lost.
> > >
> > > The group communication system doesn't just run on one system. For
> > > postgres-r using spread
> > > there is actually a spread daemon that runs on each database server.
It
> > > has nothing to do with
> > > detecting the conflicts. Its job is to deliver messages in a total
> > > order for writesets or simple order
> > > for commits, aborts, joins, etc.
> > >
> > > The detection of conflicts will be done at the database level, by a
> > > backend processes. The basic
> > > concept is "if all databases get the writesets (changes) in the exact
> > > same order, apply them in a
> > > consistent order, avoid conflicts, then one copy serialization is
> > > achieved. (one copy of the database
> > > replicated across all databases in the replica)
> > >
> > > I hope that explains the group communication system's responsibility.
> > >
> > > Darren
> > >
> > >
> > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 5: Have you checked our extensive FAQ?
> > >
> > > http://www.postgresql.org/users-lounge/docs/faq.html
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
>
>


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-15 19:42:35
Message-ID: 000701c2a472$1ed24940$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jonathan,

How do the group communication daemons on system A and B agree that T2 is
after T1?,

As I understand it the operation is performed locally before being passed on
to the group for replication, when T2 arrives at system B, system B has no
knowlege of T1 and so can perform T2 sucessfully.

I am guessing that the System B performs T2 locally, sends it to the group
communication daemon for ordering, and then receives it back from the group
communication order queue after it's position in the order queue has been
decided before it is written to the database.

This would indicate to me that there is a single central point which decides
that T2 is after T1.

Is this true?

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
To: "Al Sutton" <al(at)alsutton(dot)com>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Bruce Momjian"
<pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck" <JanWieck(at)Yahoo(dot)com>;
<shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
<pgsql-hackers(at)postgresql(dot)org>
Sent: Sunday, December 15, 2002 5:00 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

> The total order provided by the group communication daemons guarantees
> that every member will see the tranactions/writesets in the same order.
> So both A and B will see that T1 is ordered before T2 BEFORE writing
> anything back to the client. So for both servers T1 will be completed
> successfully, and T2 will be aborted because of conflicting writesets.
>
> Jonathan
>
> On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:
> > Many thanks for the explanation. Could you explain to me where the order
or
> > the writeset for the following scenario;
> >
> > If a tranasction takes 50ms to reach one database from another, for a
> > specific data element (called X), the following timeline occurs
> >
> > at 0ms, T1(X) is written to system A.
> > at 10ms, T2(X) is written to system B.
> >
> > Where T1(X) and T2(X) conflict.
> >
> > My concern is that if the Group Communication Daemon (gcd) is operating
on
> > each database, a successful result for T1(X) will returned to the
client
> > talking to database A because T2(X) has not reached it, and thus no
conflict
> > is known about, and a sucessful result is returned to the client
submitting
> > T2(X) to database B because it is not aware of T1(X). This would mean
that
> > the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
> > they can not due to the conflict.
> >
> > Thanks,
> >
> > Al.
> >
> > ----- Original Message -----
> > From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> > To: "Al Sutton" <al(at)alsutton(dot)com>
> > Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
> > <JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
> > "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
> > Sent: Saturday, December 14, 2002 6:48 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> >
> > > >
> > > >
> > > >
> > > >b) The Group Communication blob will consist of a number of processes
> > which
> > > >need to talk to all of the others to interrogate them for changes
which
> > may
> > > >conflict with the current write that being handled and then issue the
> > > >transaction response. This is basically the two phase commit solution
> > with
> > > >phases moved into the group communication process.
> > > >
> > > >I can see the possibility of using solution b and having less group
> > > >communication processes than databases as attempt to simplify things,
but
> > > >this would mean the loss of a number of databases if the machine
running
> > the
> > > >group communication process for the set of databases is lost.
> > > >
> > > The group communication system doesn't just run on one system. For
> > > postgres-r using spread
> > > there is actually a spread daemon that runs on each database server.
It
> > > has nothing to do with
> > > detecting the conflicts. Its job is to deliver messages in a total
> > > order for writesets or simple order
> > > for commits, aborts, joins, etc.
> > >
> > > The detection of conflicts will be done at the database level, by a
> > > backend processes. The basic
> > > concept is "if all databases get the writesets (changes) in the exact
> > > same order, apply them in a
> > > consistent order, avoid conflicts, then one copy serialization is
> > > achieved. (one copy of the database
> > > replicated across all databases in the replica)
> > >
> > > I hope that explains the group communication system's responsibility.
> > >
> > > Darren
> > >
> > >
> > > >
> > >
> > >
> > >
> > > ---------------------------(end of
broadcast)---------------------------
> > > TIP 5: Have you checked our extensive FAQ?
> > >
> > > http://www.postgresql.org/users-lounge/docs/faq.html
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
>
> --
> -------------------------------------------------------
> Jonathan R. Stanton jonathan(at)cs(dot)jhu(dot)edu
> Dept. of Computer Science
> Johns Hopkins University
> -------------------------------------------------------
>


From: "Al Sutton" <al(at)alsutton(dot)com>
To: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, <shridhar_daithankar(at)persistent(dot)co(dot)in>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-15 22:15:40
Message-ID: 001301c2a487$813b69d0$0100a8c0@cloud
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jonathan,

Many thanks for clarifying the situation some more. With token passing, I
have the following concerns;

1) What happends if a server holding the token should die whilst it is in
posession of the token.

2) If I have n servers, and the time to pass the token between each server
is x milliseconds, I may have to wait for upto m times x milliseconds in
order for a transaction to be processed. If a server is limited to a single
transaction per posession of the token (in order to ensure no system hogs
the token), and the server develops a queue of length y, I will have to wait
m times x times y for the transaction to be processed. Both scenarios I
beleive would not scale well beyond a small subset of servers with low
network latency between them.

If we consider the following situation I can illustrate why I'm still in
favour of a two phase commit;

Imagine, for example, credit card details about the status of an account
replicated in real time between databases in London, Moscow, Singapore,
Syndey, and New York. If any server can talk to any other server with a
guarenteed packet transfer time of 150ms a two phase commit could complete
in 600ms as it's worst case (assuming that the two phases consist of
request/response pairs, and that each server talks to all the others in
parallel). A token passing system may have to wait for the token to pass
through every other server before reaching the one that has the transaction
comitted to it, which could take about 750ms.

If you then expand the network to allow for a primary and disaster recover
database at each location the two phase commit still maintains it's 600ms
response time, but the token passing system doubles to 1500ms.

Allowing disjointed segments to continue executing is also a concern because
any split in the replication group could effectively double the accepted
card limit for any card holder should they purchase items from various
locations around the globe.

I can see an idea that the token may be passed to the system with the most
transactions in a wait state, but this would cause low volume databases to
loose out on response times to higher volume ones, which is again,
undesirable.

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
To: "Al Sutton" <al(at)alsutton(dot)com>
Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Bruce Momjian"
<pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck" <JanWieck(at)Yahoo(dot)com>;
<shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
<pgsql-hackers(at)postgresql(dot)org>
Sent: Sunday, December 15, 2002 9:17 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

> On Sun, Dec 15, 2002 at 07:42:35PM -0000, Al Sutton wrote:
> > Jonathan,
> >
> > How do the group communication daemons on system A and B agree that T2
is
> > after T1?,
>
> Lets split this into two separate problems:
>
> 1) How do the daemons totally order a set of messages (abstract
> messages)
>
> 2) How do database transactions get split into writesets that are sent
> as messages through the group communication system.
>
> As to question 1, the set of daemons (usually one running on each
> participating server) run a distributed ordering algorithm, as well as
> distributed algorithms to provide message reliability, fault-detection,
> and membership services. These are completely distributed algorithms, no
> "central" controller node exists, so even if network partitions occur
> the group communication system keeps running and providing ordering and
> reliability guarantees to messages.
>
> A number of different algorithms exist as to how to provide a total
> order on messages. Spread currently uses a token algorithm, that
> involves passing a token between the daemons, and a counter attached to
> each message, but other algorithms exist and we have implemneted some
> other ones in our research. You can find lots of details in the papers
> at www.cnds.jhu.edu/publications/ and www.spread.org.
>
> As to question 2, there are several different approaches to how to use
> such a total order for actual database replication. They all use the gcs
> total order to establish a single sequence of "events" that all the
> databases see. Then each database can act on the events as they are
> delivered by teh gcs and be guaranteed that no other database will see a
> different order.
>
> In the postgres-R case, the action received from a client is performned
> partially at the originating postgres server, the writesets are then
> sent through the gcs to order them and determine conflicts. Once they
> are delivered back, if no conflicts occured in the meantime, the
> original transaction is completed and the result returned to the client.
> If a conflict occured, the original transaction is rolled back and
> aborted. and the abort is returned to the client.
>
> >
> > As I understand it the operation is performed locally before being
passed on
> > to the group for replication, when T2 arrives at system B, system B has
no
> > knowlege of T1 and so can perform T2 sucessfully.
> >
> > I am guessing that the System B performs T2 locally, sends it to the
group
> > communication daemon for ordering, and then receives it back from the
group
> > communication order queue after it's position in the order queue has
been
> > decided before it is written to the database.
>
> If I understand the above correctly, yes, that is the same as I describe
> above.
>
> >
> > This would indicate to me that there is a single central point which
decides
> > that T2 is after T1.
>
> No, there is a distributed algorithm that determins the order. The
> distributed algorithm "emulates" a central controller who decides the
> order, but no single controller actually exists.
>
> Jonathan
>
> > ----- Original Message -----
> > From: "Jonathan Stanton" <jonathan(at)cnds(dot)jhu(dot)edu>
> > To: "Al Sutton" <al(at)alsutton(dot)com>
> > Cc: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>; "Bruce Momjian"
> > <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck" <JanWieck(at)Yahoo(dot)com>;
> > <shridhar_daithankar(at)persistent(dot)co(dot)in>; "PostgreSQL-development"
> > <pgsql-hackers(at)postgresql(dot)org>
> > Sent: Sunday, December 15, 2002 5:00 PM
> > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> >
> >
> > > The total order provided by the group communication daemons guarantees
> > > that every member will see the tranactions/writesets in the same
order.
> > > So both A and B will see that T1 is ordered before T2 BEFORE writing
> > > anything back to the client. So for both servers T1 will be completed
> > > successfully, and T2 will be aborted because of conflicting writesets.
> > >
> > > Jonathan
> > >
> > > On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:
> > > > Many thanks for the explanation. Could you explain to me where the
order
> > or
> > > > the writeset for the following scenario;
> > > >
> > > > If a tranasction takes 50ms to reach one database from another, for
a
> > > > specific data element (called X), the following timeline occurs
> > > >
> > > > at 0ms, T1(X) is written to system A.
> > > > at 10ms, T2(X) is written to system B.
> > > >
> > > > Where T1(X) and T2(X) conflict.
> > > >
> > > > My concern is that if the Group Communication Daemon (gcd) is
operating
> > on
> > > > each database, a successful result for T1(X) will returned to the
> > client
> > > > talking to database A because T2(X) has not reached it, and thus no
> > conflict
> > > > is known about, and a sucessful result is returned to the client
> > submitting
> > > > T2(X) to database B because it is not aware of T1(X). This would
mean
> > that
> > > > the two clients beleive bothe T1(X) and T2(X) completed succesfully,
yet
> > > > they can not due to the conflict.
> > > >
> > > > Thanks,
> > > >
> > > > Al.
> > > >
> > > > ----- Original Message -----
> > > > From: "Darren Johnson" <darren(at)up(dot)hrcoxmail(dot)com>
> > > > To: "Al Sutton" <al(at)alsutton(dot)com>
> > > > Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>; "Jan Wieck"
> > > > <JanWieck(at)Yahoo(dot)com>; <shridhar_daithankar(at)persistent(dot)co(dot)in>;
> > > > "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
> > > > Sent: Saturday, December 14, 2002 6:48 PM
> > > > Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >b) The Group Communication blob will consist of a number of
processes
> > > > which
> > > > > >need to talk to all of the others to interrogate them for changes
> > which
> > > > may
> > > > > >conflict with the current write that being handled and then issue
the
> > > > > >transaction response. This is basically the two phase commit
solution
> > > > with
> > > > > >phases moved into the group communication process.
> > > > > >
> > > > > >I can see the possibility of using solution b and having less
group
> > > > > >communication processes than databases as attempt to simplify
things,
> > but
> > > > > >this would mean the loss of a number of databases if the machine
> > running
> > > > the
> > > > > >group communication process for the set of databases is lost.
> > > > > >
> > > > > The group communication system doesn't just run on one system.
For
> > > > > postgres-r using spread
> > > > > there is actually a spread daemon that runs on each database
server.
> > It
> > > > > has nothing to do with
> > > > > detecting the conflicts. Its job is to deliver messages in a
total
> > > > > order for writesets or simple order
> > > > > for commits, aborts, joins, etc.
> > > > >
> > > > > The detection of conflicts will be done at the database level, by
a
> > > > > backend processes. The basic
> > > > > concept is "if all databases get the writesets (changes) in the
exact
> > > > > same order, apply them in a
> > > > > consistent order, avoid conflicts, then one copy serialization is
> > > > > achieved. (one copy of the database
> > > > > replicated across all databases in the replica)
> > > > >
> > > > > I hope that explains the group communication system's
responsibility.
> > > > >
> > > > > Darren
> > > > >
> > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ---------------------------(end of
> > broadcast)---------------------------
> > > > > TIP 5: Have you checked our extensive FAQ?
> > > > >
> > > > > http://www.postgresql.org/users-lounge/docs/faq.html
> > > >
> > > >
> > > >
> > > > ---------------------------(end of
broadcast)---------------------------
> > > > TIP 6: Have you searched our list archives?
> > > >
> > > > http://archives.postgresql.org
> > >
> > > --
> > > -------------------------------------------------------
> > > Jonathan R. Stanton jonathan(at)cs(dot)jhu(dot)edu
> > > Dept. of Computer Science
> > > Johns Hopkins University
> > > -------------------------------------------------------
> > >
> >
> >
>
> --
> -------------------------------------------------------
> Jonathan R. Stanton jonathan(at)cs(dot)jhu(dot)edu
> Dept. of Computer Science
> Johns Hopkins University
> -------------------------------------------------------
>


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Darren Johnson <darren(at)up(dot)hrcoxmail(dot)com>
Cc: Al Sutton <al(at)alsutton(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, shridhar_daithankar(at)persistent(dot)co(dot)in, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [mail] Re: Big 7.4 items - Replication
Date: 2002-12-16 03:40:17
Message-ID: 3DFD4B21.B9965A99@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Darren Johnson wrote:

> The group communication system doesn't just run on one system. For
> postgres-r using spread

The reason why group communication software is used is simply because
this software is designed with two goals in mind:

1) optimize bandwidth usage

2) make many-to-many communication easy

Number one is done by utilizing things like multicasting where
available.

Number two is done by using global scoped queues.

I add this only to avoid reading that pushing some PITR log snippets via
FTP or worse over a network would do the same. It did not in the past,
it does not do right now and it will not do in the future.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #