Quick Links

Re: Proposal: Commit timestamp

Lists:	pgsql-hackers

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Proposal: Commit timestamp
Date:	2007-01-25 23:16:05
Message-ID:	45B93A35.8040704@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

For a future multimaster replication system, I will need a couple of
features in the PostgreSQL server itself. I will submit separate
proposals per feature so that discussions can be kept focused on one
feature per thread.

For conflict resolution purposes in an asynchronous multimaster system,
the "last update" definition often comes into play. For this to work,
the system must provide a monotonically increasing timestamp taken at
the commit of a transaction. During replication, the replication process
must be able to provide the remote nodes timestamp so that the
replicated data will be "as of the time it was written on the remote
node", and not the current local time of the replica, which is by
definition of "asynchronous" later.

To provide this data, I would like to add another "log" directory,
pg_tslog. The files in this directory will be similar to the clog, but
contain arrays of timestamptz values. On commit, the current system time
will be taken. As long as this time is lower or equal to the last taken
time in this PostgreSQL instance, the value will be increased by one
microsecond. The resulting time will be added to the commit WAL record
and written into the pg_tslog file.

If a per database configurable tslog_priority is given, the timestamp
will be truncated to milliseconds and the increment logic is done on
milliseconds. The priority is added to the timestamp. This guarantees
that no two timestamps for commits will ever be exactly identical, even
across different servers.

The COMMIT syntax will get extended to

COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];

The extension is limited to superusers and will override the normally
generated commit timestamp. This will be used to give the replicating
transaction on the replica the exact same timestamp it got on the
originating master node.

The pg_tslog segments will be purged like the clog segments, after all
transactions belonging to them have been stamped frozen. A frozen xid by
definition has a timestamp of epoch. To ensure a system using this
timestamp feature has enough time to perform its work, a new GUC
variable defining an interval will prevent vacuum from freezing xid's
that are younger than that.

A function get_commit_timestamp(xid) returning timpstamptz will return
the commit time of a transaction as recorded by this feature.

Comments, changes, additions?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-25 23:47:39
Message-ID:	1169768859.5432.9.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
> For conflict resolution purposes in an asynchronous multimaster system,
> the "last update" definition often comes into play. For this to work,
> the system must provide a monotonically increasing timestamp taken at
> the commit of a transaction.

Do you really need an actual timestamptz derived from the system clock,
or would a monotonically increasing 64-bit counter be sufficient? (The
assumption that the system clock is monotonically increasing seems
pretty fragile, in the presence of manual system clock changes, ntpd,
etc.)

> Comments, changes, additions?

Would this feature have any use beyond the specific project/algorithm
you have in mind?

-Neil

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-25 23:49:43
Message-ID:	13999.1169768983@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> To provide this data, I would like to add another "log" directory,
> pg_tslog. The files in this directory will be similar to the clog, but
> contain arrays of timestamptz values.

Why should everybody be made to pay this overhead?

> The COMMIT syntax will get extended to
> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
> The extension is limited to superusers and will override the normally
> generated commit timestamp. This will be used to give the replicating
> transaction on the replica the exact same timestamp it got on the
> originating master node.

I'm not convinced you've even thought this through. If you do that then
you have no guarantee of commit timestamp monotonicity on the slave
(if it has either multi masters or any locally generated transactions).
Since this is supposedly for a multi-master system, that seems a rather
fatal objection --- no node in the system will actually have commit
timestamp monotonicity. What are you hoping to accomplish with this?

regards, tom lane

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Neil Conway <neilc(at)samurai(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 00:31:20
Message-ID:	45B94BD8.2000202@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/25/2007 6:47 PM, Neil Conway wrote:
> On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
>> For conflict resolution purposes in an asynchronous multimaster system,
>> the "last update" definition often comes into play. For this to work,
>> the system must provide a monotonically increasing timestamp taken at
>> the commit of a transaction.
>
> Do you really need an actual timestamptz derived from the system clock,
> or would a monotonically increasing 64-bit counter be sufficient? (The
> assumption that the system clock is monotonically increasing seems
> pretty fragile, in the presence of manual system clock changes, ntpd,
> etc.)

Yes, I do need it to be a timestamp, and one assumption is that all
servers in the multimaster cluster are ntp synchronized. The reason is
that this is for asynchronous multimaster (in my case). Two sequences
running on separate systems don't tell which was the "last update" on a
timeline. This conflict resolution method alone is of course completely
inadequate.

>
>> Comments, changes, additions?
>
> Would this feature have any use beyond the specific project/algorithm
> you have in mind?

The tablelog project on pgfoundry currently uses the transactions start
time but would be very delighted to have the commit time available instead.

Jan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 00:41:48
Message-ID:	15256.1169772108@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> On 1/25/2007 6:47 PM, Neil Conway wrote:
>> Would this feature have any use beyond the specific project/algorithm
>> you have in mind?

> The tablelog project on pgfoundry currently uses the transactions start
> time but would be very delighted to have the commit time available instead.

BTW, it's not clear to me why you need a new log area for this. (We
don't log transaction start time anywhere, so certainly tablelog's needs
would not include it.) Commit timestamps are available from WAL commit
records in a crash-and-restart scenario, so wouldn't that be enough?

regards, tom lane

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 00:42:24
Message-ID:	45B94E70.9020803@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/25/2007 6:49 PM, Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> To provide this data, I would like to add another "log" directory,
>> pg_tslog. The files in this directory will be similar to the clog, but
>> contain arrays of timestamptz values.
>
> Why should everybody be made to pay this overhead?

It could be made an initdb time option. If you intend to use a product
that requires this feature, you will be willing to pay that price.

>
>> The COMMIT syntax will get extended to
>> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>> The extension is limited to superusers and will override the normally
>> generated commit timestamp. This will be used to give the replicating
>> transaction on the replica the exact same timestamp it got on the
>> originating master node.
>
> I'm not convinced you've even thought this through. If you do that then
> you have no guarantee of commit timestamp monotonicity on the slave
> (if it has either multi masters or any locally generated transactions).
> Since this is supposedly for a multi-master system, that seems a rather
> fatal objection --- no node in the system will actually have commit
> timestamp monotonicity. What are you hoping to accomplish with this?

Maybe I wasn't clear enough about this. If the commit timestamps on the
local machine are guaranteed to increase at least by one millisecond
(okay that limits the system to a sustained 1000 commits per second
before it really seems to run ahead of time), then no two commits on the
same instance will ever have the same timestamp. If furthermore each
instance in a cluster has a distinct priority (the microsecond part
added to the millisecond-truncated timestamp), each commit timestamp
could even act as a globally unique ID. It does require that all the
nodes in the cluster are configured with a distinct priority.

What I hope to accomplish with this is a very easy, commit time based
"last update wins" conflict resolution for data fields of the overwrite
nature.

The replication system I have in mind will have another field type of
the balance nature, where it will never communicate the current value
but only deltas that get applied regardless of the two timestamps.

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 00:49:46
Message-ID:	45B9502A.1060405@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/25/2007 7:41 PM, Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> On 1/25/2007 6:47 PM, Neil Conway wrote:
>>> Would this feature have any use beyond the specific project/algorithm
>>> you have in mind?
>
>> The tablelog project on pgfoundry currently uses the transactions start
>> time but would be very delighted to have the commit time available instead.
>
> BTW, it's not clear to me why you need a new log area for this. (We
> don't log transaction start time anywhere, so certainly tablelog's needs
> would not include it.) Commit timestamps are available from WAL commit
> records in a crash-and-restart scenario, so wouldn't that be enough?

First, I need the timestamp of the original transaction that caused the
data to change, which can be a remote or a local transaction. So the
timestamp currently recorded in the WAL commit record is useless and the
commit record has to be extended by one more timestamp.

Second, I don't think that an API scanning for WAL commit records by xid
would be efficient enough to satisfy the needs of a timestamp based
conflict resolution system, which would have to retrieve the timestamp
for every rows xmin that it is about to update in order to determine if
the old or the new values should be used.

Third, keeping the timestamp information in the WAL only would require
to keep the WAL segments around until they are older than the admin
chosen minimum freeze age. I hope you don't want to force that penalty
on everyone who intends to use multimaster replication.

Jan

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 01:42:32
Message-ID:	Pine.LNX.4.33.0701251739390.30496-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 25 Jan 2007, Jan Wieck wrote:
>
> For a future multimaster replication system, I will need a couple of
> features in the PostgreSQL server itself. I will submit separate
> proposals per feature so that discussions can be kept focused on one
> feature per thread.

Hmm... "will need" ... Have you prototyped this system yet? ISTM you can
prototype your proposal using "external" components so you can work out
the kinks first.

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 02:01:29
Message-ID:	45B960F9.4040702@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/25/2007 8:42 PM, Richard Troy wrote:
> On Thu, 25 Jan 2007, Jan Wieck wrote:
>>
>> For a future multimaster replication system, I will need a couple of
>> features in the PostgreSQL server itself. I will submit separate
>> proposals per feature so that discussions can be kept focused on one
>> feature per thread.
>
> Hmm... "will need" ... Have you prototyped this system yet? ISTM you can
> prototype your proposal using "external" components so you can work out
> the kinks first.

These details are pretty drilled down and are needed with the described
functionality. And I will not make the same mistake as with Slony-I
again and develop things, that require backend support, as totally
external (look at the catalog corruption mess I created there and you
know what I'm talking about).

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 04:41:30
Message-ID:	200701260441.l0Q4fUF05020@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 1/25/2007 6:49 PM, Tom Lane wrote:
> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> >> To provide this data, I would like to add another "log" directory,
> >> pg_tslog. The files in this directory will be similar to the clog, but
> >> contain arrays of timestamptz values.
> >
> > Why should everybody be made to pay this overhead?
>
> It could be made an initdb time option. If you intend to use a product
> that requires this feature, you will be willing to pay that price.

That is going to cut your usage by like 80%. There must be a better
way.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 05:41:55
Message-ID:	45B994A3.8090804@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/25/2007 11:41 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> On 1/25/2007 6:49 PM, Tom Lane wrote:
>> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> >> To provide this data, I would like to add another "log" directory,
>> >> pg_tslog. The files in this directory will be similar to the clog, but
>> >> contain arrays of timestamptz values.
>> >
>> > Why should everybody be made to pay this overhead?
>>
>> It could be made an initdb time option. If you intend to use a product
>> that requires this feature, you will be willing to pay that price.
>
> That is going to cut your usage by like 80%. There must be a better
> way.

I'd love to.

But it is a datum that needs to be collected at the moment where
basically the clog entry is made ... I don't think any external module
can do that ever.

You know how long I've been in and out and back into replication again.
The one thing that pops up again and again in all the scenarios is "what
the heck was the commit order?". Now the pure commit order for a single
node could certainly be recorded from a sequence, but that doesn't cover
the multi-node environment I am after. That's why I want it to be a
timestamp with a few fudged bits at the end. If you look at what I've
described, you will notice that as long as all node priorities are
unique, this timestamp will be a globally unique ID in a somewhat
ascending order along a timeline. That is what replication people are
looking for.

Tom fears that the overhead is significant, which I do understand and
frankly, wonder myself about (actually I don't even have a vague
estimate). I really think we should make this thing an initdb option and
decide later if it's on or off by default. Probably we can implement it
even in a way that one can turn it on/off and a postmaster restart plus
waiting the desired freeze-delay would do.

What I know for certain is that no async replication system can ever do
without the commit timestamp information. Using the transaction start
time or even the single statements timeofday will only lead to
inconsistencies all over the place (I haven't been absent from the
mailing lists for the past couple of month hiding in my closet ... I've
been experimenting and trying to get around all these issues - in my
closet). Slony-I can survive without that information because everything
happens on one node and we record snapshot information for later abusal.
But look at what cost we are dealing with this rather trivial issue. All
we need to know is the serializable commit order. And we have to issue
queries that eventually might exceed address space limits?

Jan

From:	Naz Gassiep <naz(at)mira(dot)net>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 07:37:13
Message-ID:	45B9AFA9.9080601@mira.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I would be *very* concerned that system time is not a guaranteed
monotonic entity. Surely a counter or other internally managed mechanism
would be a better solution.

Furthermore, what would be the ramifications of master and slave system
times being out of sync?

Finally what if system time is rolled forward a few minutes as part of a
correction and there were transactions completed in that time? There is
a change, albeit small, that two transactions will have the same
timestamp. More importantly, this will throw all kinds of issues in when
the slave sees transactions in the future. Even with regular NTP syncs,
drift can cause a clock to be rolled forward a few milliseconds,
possibly resulting in duplicate transaction IDs.

In summary, I don't think the use of system time has any place in
PostgreSQL's internal consistency mechanisms, it is too unreliable an
environment property. Why can't a counter be used for this instead?

- Naz.

Jan Wieck wrote:
> For a future multimaster replication system, I will need a couple of
> features in the PostgreSQL server itself. I will submit separate
> proposals per feature so that discussions can be kept focused on one
> feature per thread.
>
> For conflict resolution purposes in an asynchronous multimaster
> system, the "last update" definition often comes into play. For this
> to work, the system must provide a monotonically increasing timestamp
> taken at the commit of a transaction. During replication, the
> replication process must be able to provide the remote nodes timestamp
> so that the replicated data will be "as of the time it was written on
> the remote node", and not the current local time of the replica, which
> is by definition of "asynchronous" later.
>
> To provide this data, I would like to add another "log" directory,
> pg_tslog. The files in this directory will be similar to the clog, but
> contain arrays of timestamptz values. On commit, the current system
> time will be taken. As long as this time is lower or equal to the last
> taken time in this PostgreSQL instance, the value will be increased by
> one microsecond. The resulting time will be added to the commit WAL
> record and written into the pg_tslog file.
>
> If a per database configurable tslog_priority is given, the timestamp
> will be truncated to milliseconds and the increment logic is done on
> milliseconds. The priority is added to the timestamp. This guarantees
> that no two timestamps for commits will ever be exactly identical,
> even across different servers.
>
> The COMMIT syntax will get extended to
>
> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>
> The extension is limited to superusers and will override the normally
> generated commit timestamp. This will be used to give the replicating
> transaction on the replica the exact same timestamp it got on the
> originating master node.
>
> The pg_tslog segments will be purged like the clog segments, after all
> transactions belonging to them have been stamped frozen. A frozen xid
> by definition has a timestamp of epoch. To ensure a system using this
> timestamp feature has enough time to perform its work, a new GUC
> variable defining an interval will prevent vacuum from freezing xid's
> that are younger than that.
>
> A function get_commit_timestamp(xid) returning timpstamptz will return
> the commit time of a transaction as recorded by this feature.
>
>
> Comments, changes, additions?
>
> Jan
>

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 10:36:11
Message-ID:	45B9D99B.30309@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Jan Wieck wrote:
> The replication system I have in mind will have another field type of
> the balance nature, where it will never communicate the current value
> but only deltas that get applied regardless of the two timestamps.

I'd favor a more generally usable conflict resolution function
interface, on top of which you can implement both, the "last update
wins" as well as the "balance" conflict resolution type.

Passing the last common ancestor and the two conflicting heads to the
conflict resolution function (CRF) should be enough. That would easily
allow to implement the "balance" type (as you can calculate both
deltas). And if you want to rely on something as arbitrary as a
timestamp, you'd simply have to add a timestamp column to your table and
let the CRF decide uppon that.

This would allow pretty much any type of conflict resolution, for
example: higher priority cleanup transactions, which change lots of
tuples and should better not be aborted later on. Those could be
implemented by adding a priority column and having the CRF respect that
one, too.

To find the last common ancestor tuple, transaction ids and MVCC are
enough. You wouldn't need to add timestamps. You'd only have to make
sure VACUUM doesn't delete tuples you still need.

Regards

Markus

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 10:54:25
Message-ID:	45B9DDE1.2080802@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> But it is a datum that needs to be collected at the moment where
> basically the clog entry is made ... I don't think any external module
> can do that ever.

How atomic does it need to be? External modules can register callbacks
that get called right after the clog update and removing the xid from
MyProc entry. That's about as close to making the clog entry you can get.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Naz Gassiep <naz(at)mira(dot)net>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 13:11:23
Message-ID:	A34395C2-74F7-4FCE-9EF1-4F00E0CA1EB8@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan, et. al.,

On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed
> monotonic entity. Surely a counter or other internally managed
> mechanism would be a better solution.

As you should be concerned. Looking on my desk through the last few
issues in IEEE Transactions on Parallel and Distributed Systems, I
see no time synch stuff for clusters of machines that is actually
based on time. Almost all rely on something like a Lamport timestamp
or some relaxation thereof. A few are based off a tree based pulse.
Using actual times is fraught with problems and is typically
inappropriate for cluster synchronization needs.

> Furthermore, what would be the ramifications of master and slave
> system times being out of sync?

I'm much more concerned with the overall approach. The algorithm for
replication should be published in theoretic style with a thorough
analysis of its assumptions and a proof of correctness based on those
assumptions. Databases and replication therein are definitely
technologies that aren't "off-the-cuff," and rigorous academic
discussion and acceptance before they will get adopted. People
generally will not adopt technologies to store mission critical data
until they are confident that it will both work as designed and work
as implemented -- the second is far less important as the weakness
there are simply bugs.

I'm not implying that this rigorous dissection of replication design
hasn't happened, but I didn't see it referenced anywhere in this
thread. Can you point me to it? I've reviewed many of these papers
and would like to better understand what you are aiming at.

Best regards,

Theo Schlossnagle

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Naz Gassiep <naz(at)mira(dot)net>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 13:21:40
Message-ID:	45BA0064.2050708@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed
> monotonic entity. Surely a counter or other internally managed mechanism
> would be a better solution.

Such a counter has only "local" relevance. How do you plan to compare
the two separate counters on different machines to tell which
transaction happened last?

Even if the system clock isn't monotonically increasing, the described
increment system guarantees the timestamp used to appear so. Granted,
this system will not work too well on a platform that doesn't allow to
slew the system clock.

>
> Furthermore, what would be the ramifications of master and slave system
> times being out of sync?

The origin of a transaction must scan all tuples it updates and make
sure that the timestamp it uses for commit appears in the future with
respect to them.

>
> Finally what if system time is rolled forward a few minutes as part of a
> correction and there were transactions completed in that time? There is
> a change, albeit small, that two transactions will have the same
> timestamp. More importantly, this will throw all kinds of issues in when
> the slave sees transactions in the future. Even with regular NTP syncs,
> drift can cause a clock to be rolled forward a few milliseconds,
> possibly resulting in duplicate transaction IDs.
>
> In summary, I don't think the use of system time has any place in
> PostgreSQL's internal consistency mechanisms, it is too unreliable an
> environment property. Why can't a counter be used for this instead?

This is nothing used for PostgreSQL's consistency. It is a vehicle
intended to be used to synchronize the "last update wins" decision
process of an asynchronous multimaster system. If not with a timestamp,
how would you make sure that the replication processes of two different
nodes will come to the same conclusion as to which update was last?
Especially considering that the replication might take place hours after
the original transaction happened.

Jan

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc:	"PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 13:26:21
Message-ID:	1169817982.3772.315.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:

> To provide this data, I would like to add another "log" directory,
> pg_tslog. The files in this directory will be similar to the clog, but
> contain arrays of timestamptz values. On commit, the current system time
> will be taken. As long as this time is lower or equal to the last taken
> time in this PostgreSQL instance, the value will be increased by one
> microsecond. The resulting time will be added to the commit WAL record
> and written into the pg_tslog file.

A transaction time table/log has other uses as well, so its fairly
interesting to have this.

> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>
> The extension is limited to superusers and will override the normally
> generated commit timestamp.

I don't think its acceptable to override the normal timestamp. That
could lead to non monotonic time values which could screw up PITR. My
view is that you still need PITR even when you are using replication,
because the former provides recoverability and the latter provides
availability.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 13:44:58
Message-ID:	45BA05DA.5070607@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/26/2007 8:26 AM, Simon Riggs wrote:
> On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
>
>> To provide this data, I would like to add another "log" directory,
>> pg_tslog. The files in this directory will be similar to the clog, but
>> contain arrays of timestamptz values. On commit, the current system time
>> will be taken. As long as this time is lower or equal to the last taken
>> time in this PostgreSQL instance, the value will be increased by one
>> microsecond. The resulting time will be added to the commit WAL record
>> and written into the pg_tslog file.
>
> A transaction time table/log has other uses as well, so its fairly
> interesting to have this.
>
>> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>>
>> The extension is limited to superusers and will override the normally
>> generated commit timestamp.
>
> I don't think its acceptable to override the normal timestamp. That
> could lead to non monotonic time values which could screw up PITR. My
> view is that you still need PITR even when you are using replication,
> because the former provides recoverability and the latter provides
> availability.

Without that it is rendered useless for conflict resolution purposes.

The timestamp used does not necessarily have much to do with the real
time at commit. Although I'd like it to be as close as possible. This
timestamp marks the age of the new datum in an update. Since the
replication is asynchronous, the update on the remote systems will
happen later, but the timestamp recorded with that datum must be the
timestamp of the original transaction, not the current time when it is
replicated remotely. All we have to determine that is the xmin in the
rows tuple header, so that xmin must resolve to the original
transactions timestamp.

Jan

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Naz Gassiep <naz(at)mira(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 14:38:49
Message-ID:	20070126143849.GP24675@kenobi.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Jan Wieck (JanWieck(at)Yahoo(dot)com) wrote:
> On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> >I would be *very* concerned that system time is not a guaranteed
> >monotonic entity. Surely a counter or other internally managed mechanism
> >would be a better solution.
>
> Such a counter has only "local" relevance. How do you plan to compare
> the two separate counters on different machines to tell which
> transaction happened last?

I'd also suggest you look into Lamport timestamps... Trusting the
system clock just isn't practical, even with NTP. I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist. I've probably got some code I could
share as well.

Thanks,

Stephen

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Naz Gassiep <naz(at)mira(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 15:10:56
Message-ID:	45BA1A00.2030300@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost wrote:
> I'd also suggest you look into Lamport timestamps... Trusting the
> system clock just isn't practical, even with NTP. I've developed
> (albeit relatively small) systems using Lamport timestamps and would be
> happy to talk about it offlist. I've probably got some code I could
> share as well.
>

that looks like what Oracle RAC uses:
http://www.lc.leidenuniv.nl/awcourse/oracle/rac.920/a96597/coord.htm

cheers

andrew

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Naz Gassiep <naz(at)mira(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-26 15:53:47
Message-ID:	45BA240B.60304@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/26/2007 9:38 AM, Stephen Frost wrote:
> * Jan Wieck (JanWieck(at)Yahoo(dot)com) wrote:
>> On 1/26/2007 2:37 AM, Naz Gassiep wrote:
>> >I would be *very* concerned that system time is not a guaranteed
>> >monotonic entity. Surely a counter or other internally managed mechanism
>> >would be a better solution.
>>
>> Such a counter has only "local" relevance. How do you plan to compare
>> the two separate counters on different machines to tell which
>> transaction happened last?
>
> I'd also suggest you look into Lamport timestamps... Trusting the
> system clock just isn't practical, even with NTP. I've developed
> (albeit relatively small) systems using Lamport timestamps and would be
> happy to talk about it offlist. I've probably got some code I could
> share as well.

I think the system I described is a slightly modified Lamport generator.
The maximum timestamp of any row updated in this transaction, you can
consider that the "counters received from other nodes". Then I make sure
that the next counter (timestamp) is higher than anything I know so far,
and I add cluster-wide unique tie breaker to that.

Looking closer, I don't even have to check the timestamps of the rows
updated. Since a remote transaction replicated will bump the local
Lamport clock on commit, a local transaction modifying such a row will
have a timestamp in the future of that remote transaction, even if my
local clock is limping behind.

Jan

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Naz Gassiep <naz(at)mira(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-27 12:26:14
Message-ID:	87k5z8oduh.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:

> I think the system I described is a slightly modified Lamport generator. The
> maximum timestamp of any row updated in this transaction, you can consider that
> the "counters received from other nodes". Then I make sure that the next
> counter (timestamp) is higher than anything I know so far, and I add
> cluster-wide unique tie breaker to that.

If you know all the timestamps in the system then you don't need timestamps at
all, just use a counter that you increment by one each time.

Isn't the whole reason people use timestamps is so that you don't have to
depend on atomically knowing every timestamp in the system? So two
transactions can commit simultaneously on different systems and use the
timestamps to resolve conflicts later.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Naz Gassiep <naz(at)mira(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-01-27 14:14:02
Message-ID:	45BB5E2A.3070103@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/27/2007 7:26 AM, Gregory Stark wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>
>> I think the system I described is a slightly modified Lamport generator. The
>> maximum timestamp of any row updated in this transaction, you can consider that
>> the "counters received from other nodes". Then I make sure that the next
>> counter (timestamp) is higher than anything I know so far, and I add
>> cluster-wide unique tie breaker to that.
>
> If you know all the timestamps in the system then you don't need timestamps at
> all, just use a counter that you increment by one each time.
>
> Isn't the whole reason people use timestamps is so that you don't have to
> depend on atomically knowing every timestamp in the system? So two
> transactions can commit simultaneously on different systems and use the
> timestamps to resolve conflicts later.

This assumes that you never lose contact to the cluster or if so,
instantly stop all update activity because you are at risk that the
counters diverge. This risk is much higher with a simple counter than
with a system clock that was in sync at the time of disconnect.

With all the disadvantages and the pain factor of an asynchronous
multimaster replication system comes one big advantage. You can continue
autonomously and let conflict resolution figure it out later.

Jan

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-02 04:23:00
Message-ID:	8E577E67-51DD-4944-B33E-183AA6D081A2@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
> If a per database configurable tslog_priority is given, the
> timestamp will be truncated to milliseconds and the increment logic
> is done on milliseconds. The priority is added to the timestamp.
> This guarantees that no two timestamps for commits will ever be
> exactly identical, even across different servers.

Wouldn't it be better to just store that information separately,
rather than mucking with the timestamp?

Though, there's anothe issue here... I don't think NTP is good for
any better than a few milliseconds, even on a local network.

How exact does the conflict resolution need to be, anyway? Would it
really be a problem if transaction B committed 0.1 seconds after
transaction A yet the cluster thought it was the other way around?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Jim Nasby <decibel(at)decibel(dot)org>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 20:52:50
Message-ID:	45C4F622.8000906@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/1/2007 11:23 PM, Jim Nasby wrote:
> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>> If a per database configurable tslog_priority is given, the
>> timestamp will be truncated to milliseconds and the increment logic
>> is done on milliseconds. The priority is added to the timestamp.
>> This guarantees that no two timestamps for commits will ever be
>> exactly identical, even across different servers.
>
> Wouldn't it be better to just store that information separately,
> rather than mucking with the timestamp?
>
> Though, there's anothe issue here... I don't think NTP is good for
> any better than a few milliseconds, even on a local network.
>
> How exact does the conflict resolution need to be, anyway? Would it
> really be a problem if transaction B committed 0.1 seconds after
> transaction A yet the cluster thought it was the other way around?

Since the timestamp is basically a Lamport counter which is just bumped
be the clock as well, it doesn't need to be too precise.

Jan

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 21:05:20
Message-ID:	4442A7DA-D715-4F15-B8C5-7A6FF8597DF2@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:

> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>> If a per database configurable tslog_priority is given, the
>>> timestamp will be truncated to milliseconds and the increment
>>> logic is done on milliseconds. The priority is added to the
>>> timestamp. This guarantees that no two timestamps for commits
>>> will ever be exactly identical, even across different servers.
>> Wouldn't it be better to just store that information separately,
>> rather than mucking with the timestamp?
>> Though, there's anothe issue here... I don't think NTP is good
>> for any better than a few milliseconds, even on a local network.
>> How exact does the conflict resolution need to be, anyway? Would
>> it really be a problem if transaction B committed 0.1 seconds
>> after transaction A yet the cluster thought it was the other way
>> around?
>
> Since the timestamp is basically a Lamport counter which is just
> bumped be the clock as well, it doesn't need to be too precise.

Unless I'm missing something, you are _treating_ the counter as a
Lamport timestamp, when in fact it is not and thus does not provide
semantics of a Lamport timestamp. As such, any algorithms that use
lamport timestamps as a basis or assumption for the proof of their
correctness will not translate (provably) to this system.

How are your counter semantically equivalent to Lamport timestamps?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Theo Schlossnagle <jesus(at)omniti(dot)com>
Cc:	Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 21:38:51
Message-ID:	45C500EB.6060805@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
>
>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>> If a per database configurable tslog_priority is given, the
>>>> timestamp will be truncated to milliseconds and the increment
>>>> logic is done on milliseconds. The priority is added to the
>>>> timestamp. This guarantees that no two timestamps for commits
>>>> will ever be exactly identical, even across different servers.
>>> Wouldn't it be better to just store that information separately,
>>> rather than mucking with the timestamp?
>>> Though, there's anothe issue here... I don't think NTP is good
>>> for any better than a few milliseconds, even on a local network.
>>> How exact does the conflict resolution need to be, anyway? Would
>>> it really be a problem if transaction B committed 0.1 seconds
>>> after transaction A yet the cluster thought it was the other way
>>> around?
>>
>> Since the timestamp is basically a Lamport counter which is just
>> bumped be the clock as well, it doesn't need to be too precise.
>
> Unless I'm missing something, you are _treating_ the counter as a
> Lamport timestamp, when in fact it is not and thus does not provide
> semantics of a Lamport timestamp. As such, any algorithms that use
> lamport timestamps as a basis or assumption for the proof of their
> correctness will not translate (provably) to this system.
>
> How are your counter semantically equivalent to Lamport timestamps?

Yes, you must be missing something.

The last used timestamp is remembered. When a remote transaction is
replicated, the remembered timestamp is set to max(remembered, remote).
For a local transaction, the remembered timestamp is set to
max(remembered+1ms, systemclock) and that value is used as the
transaction commit timestamp.

Jan

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 21:58:27
Message-ID:	4A2CFDE6-4965-42AB-BD2A-9DAA9B25FC36@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:

> On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
>> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
>>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>>> If a per database configurable tslog_priority is given, the
>>>>> timestamp will be truncated to milliseconds and the increment
>>>>> logic is done on milliseconds. The priority is added to the
>>>>> timestamp. This guarantees that no two timestamps for commits
>>>>> will ever be exactly identical, even across different servers.
>>>> Wouldn't it be better to just store that information
>>>> separately, rather than mucking with the timestamp?
>>>> Though, there's anothe issue here... I don't think NTP is good
>>>> for any better than a few milliseconds, even on a local network.
>>>> How exact does the conflict resolution need to be, anyway?
>>>> Would it really be a problem if transaction B committed 0.1
>>>> seconds after transaction A yet the cluster thought it was the
>>>> other way around?
>>>
>>> Since the timestamp is basically a Lamport counter which is just
>>> bumped be the clock as well, it doesn't need to be too precise.
>> Unless I'm missing something, you are _treating_ the counter as a
>> Lamport timestamp, when in fact it is not and thus does not
>> provide semantics of a Lamport timestamp. As such, any
>> algorithms that use lamport timestamps as a basis or assumption
>> for the proof of their correctness will not translate (provably)
>> to this system.
>> How are your counter semantically equivalent to Lamport timestamps?
>
> Yes, you must be missing something.
>
> The last used timestamp is remembered. When a remote transaction is
> replicated, the remembered timestamp is set to max(remembered,
> remote). For a local transaction, the remembered timestamp is set
> to max(remembered+1ms, systemclock) and that value is used as the
> transaction commit timestamp.

A Lamport clock, IIRC, require a cluster wide tick. This seems based
only on activity and is thus an observational tick only which means
various nodes can have various perspectives at different times.

Given that time skew is prevalent, why is the system clock involved
at all?

As is usual distributed systems problems, they are very hard to
explain casually and also hard to review from a theoretical angle
without a proof. Are you basing this off a paper? If so which one?
If not, have you written a rigorous proof of correctness for this
approach?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Theo Schlossnagle <jesus(at)omniti(dot)com>
Cc:	Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 22:09:00
Message-ID:	45C507FC.4010405@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:
> On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:
>
>> On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:
>>> On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:
>>>> On 2/1/2007 11:23 PM, Jim Nasby wrote:
>>>>> On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
>>>>>> If a per database configurable tslog_priority is given, the
>>>>>> timestamp will be truncated to milliseconds and the increment
>>>>>> logic is done on milliseconds. The priority is added to the
>>>>>> timestamp. This guarantees that no two timestamps for commits
>>>>>> will ever be exactly identical, even across different servers.
>>>>> Wouldn't it be better to just store that information
>>>>> separately, rather than mucking with the timestamp?
>>>>> Though, there's anothe issue here... I don't think NTP is good
>>>>> for any better than a few milliseconds, even on a local network.
>>>>> How exact does the conflict resolution need to be, anyway?
>>>>> Would it really be a problem if transaction B committed 0.1
>>>>> seconds after transaction A yet the cluster thought it was the
>>>>> other way around?
>>>>
>>>> Since the timestamp is basically a Lamport counter which is just
>>>> bumped be the clock as well, it doesn't need to be too precise.
>>> Unless I'm missing something, you are _treating_ the counter as a
>>> Lamport timestamp, when in fact it is not and thus does not
>>> provide semantics of a Lamport timestamp. As such, any
>>> algorithms that use lamport timestamps as a basis or assumption
>>> for the proof of their correctness will not translate (provably)
>>> to this system.
>>> How are your counter semantically equivalent to Lamport timestamps?
>>
>> Yes, you must be missing something.
>>
>> The last used timestamp is remembered. When a remote transaction is
>> replicated, the remembered timestamp is set to max(remembered,
>> remote). For a local transaction, the remembered timestamp is set
>> to max(remembered+1ms, systemclock) and that value is used as the
>> transaction commit timestamp.
>
> A Lamport clock, IIRC, require a cluster wide tick. This seems based
> only on activity and is thus an observational tick only which means
> various nodes can have various perspectives at different times.
>
> Given that time skew is prevalent, why is the system clock involved
> at all?

This question was already answered.

> As is usual distributed systems problems, they are very hard to
> explain casually and also hard to review from a theoretical angle
> without a proof. Are you basing this off a paper? If so which one?
> If not, have you written a rigorous proof of correctness for this
> approach?

I don't have any such paper and the proof of concept will be the
implementation of the system. I do however see enough resistance against
this proposal to withdraw the commit timestamp at this time. The new
replication system will therefore require the installation of a patched,
non-standard PostgreSQL version, compiled from sources cluster wide in
order to be used. I am aware that this will dramatically reduce it's
popularity but it is impossible to develop this essential feature as an
external module.

I thank everyone for their attention.

Jan

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 22:19:15
Message-ID:	A6D9F343-F4D7-4A0A-8794-FE42235A39A8@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 3, 2007, at 5:09 PM, Jan Wieck wrote:

> On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:
>
> I don't have any such paper and the proof of concept will be the
> implementation of the system. I do however see enough resistance
> against this proposal to withdraw the commit timestamp at this
> time. The new replication system will therefore require the
> installation of a patched, non-standard PostgreSQL version,
> compiled from sources cluster wide in order to be used. I am aware
> that this will dramatically reduce it's popularity but it is
> impossible to develop this essential feature as an external module.
>
> I thank everyone for their attention.

Actually, I believe the commit timestamp stuff would be very useful
in general. I would certainly like to see rigorous proofs of any
multi-master replication technology built on top of them. I believe
that while your replication stuff might rely on the commit
timestamps, the commit timestamps rely on thing else (except the work
that you have been churning on).

Using commit timestamps, one can easily implement cross vendor
database replication. These can be used to implement something like
trigger selective redo logs. I think they can be used to produce DML
logs that will require a lot less accounting to manage replicating
tables from PostgreSQL into another database (like Oracle or MySQL).

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-03 22:20:45
Message-ID:	200702032220.l13MKjo25268@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> I don't have any such paper and the proof of concept will be the
> implementation of the system. I do however see enough resistance against
> this proposal to withdraw the commit timestamp at this time. The new
> replication system will therefore require the installation of a patched,
> non-standard PostgreSQL version, compiled from sources cluster wide in
> order to be used. I am aware that this will dramatically reduce it's
> popularity but it is impossible to develop this essential feature as an
> external module.
>
> I thank everyone for their attention.

Going and working on it on your own doesn't seem like the proper
solution. I don't see people objecting to adding it, but they want it
work, which I am sure you want too. You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 05:04:54
Message-ID:	45C56976.7060508@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/3/2007 5:20 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> I don't have any such paper and the proof of concept will be the
>> implementation of the system. I do however see enough resistance against
>> this proposal to withdraw the commit timestamp at this time. The new
>> replication system will therefore require the installation of a patched,
>> non-standard PostgreSQL version, compiled from sources cluster wide in
>> order to be used. I am aware that this will dramatically reduce it's
>> popularity but it is impossible to develop this essential feature as an
>> external module.
>>
>> I thank everyone for their attention.
>
> Going and working on it on your own doesn't seem like the proper
> solution. I don't see people objecting to adding it, but they want it
> work, which I am sure you want too. You have to show how it will work
> and convince others of that, and then you have a higher chance it will
> work, and be in the PostgreSQL codebase.

Bruce,

I think I have sufficiently detailed explained how this Lamport
timestamp will be unique and ever increasing, with the nodes ID being
used as a tie breaker. The only thing important for "last update wins"
conflict resolution is that whatever timestamp you have associated with
a row, the update you do to it must be associated with a later timestamp
so that all other nodes will overwrite the data. If a third node gets
the two updates out of order, it will do the second nodes update and
since the row it has then has a later timestamp then the first update
arriving late, it will throw away that information. All nodes in sync
again.

This is all that is needed for last update wins resolution. And as said
before, the only reason the clock is involved in this is so that nodes
can continue autonomously when they lose connection without conflict
resolution going crazy later on, which it would do if they were simple
counters. It doesn't require microsecond synchronized clocks and the
system clock isn't just used as a Lamport timestamp.

The problem seems to me that people want a full scale proof of concept
for the whole multimaster replication system I'm planning instead of
thinking isolated about this one aspect, the intended use case and other
possible uses for it (like table logging). And we all know that that
discussion will take us way behind the 8.3 feature freeze date, so the
whole thing will never get done.

I don't want to work on this on my own and I sure would prefer it to be
a default PostgreSQL feature. As said, I have learned some things from
Slony-I. One of them is that I will not go through any more ugly
workarounds in order to not require a patched backend. If the features I
really need aren't going to be in the default codebase, people will have
to install from patched sources.

Finally, again, Slony-I could have well used this feature. With a
logical commit timestamp, I would have never even thought about that
other wart called xxid. It would have all been sooo much easier.

Jan

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jan Wieck <JanWieck(at)yahoo(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 08:16:42
Message-ID:	200702040916.44380.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> This is all that is needed for last update wins resolution. And as
> said before, the only reason the clock is involved in this is so that
> nodes can continue autonomously when they lose connection without
> conflict resolution going crazy later on, which it would do if they
> were simple counters. It doesn't require microsecond synchronized
> clocks and the system clock isn't just used as a Lamport timestamp.

Earlier you said that "one assumption is that all servers in the
multimaster cluster are ntp synchronized", which already rung the alarm
bells in me. Now that I read this you appear to require
synchronization not on the microsecond level but on some level. I
think that would be pretty hard to manage for an administrator, seeing
that NTP typically cannot provide such guarantees.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 15:06:27
Message-ID:	45C5F673.2070307@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
> Jan Wieck wrote:
>> This is all that is needed for last update wins resolution. And as
>> said before, the only reason the clock is involved in this is so that
>> nodes can continue autonomously when they lose connection without
>> conflict resolution going crazy later on, which it would do if they
>> were simple counters. It doesn't require microsecond synchronized
>> clocks and the system clock isn't just used as a Lamport timestamp.
>
> Earlier you said that "one assumption is that all servers in the
> multimaster cluster are ntp synchronized", which already rung the alarm
> bells in me. Now that I read this you appear to require
> synchronization not on the microsecond level but on some level. I
> think that would be pretty hard to manage for an administrator, seeing
> that NTP typically cannot provide such guarantees.

Synchronization to some degree is wanted to avoid totally unexpected
behavior. The conflict resolution algorithm itself can perfectly fine
live with counters, but I guess you wouldn't want the result of it. If
you update a record on one node, then 10 minutes later you update the
same record on another node. Unfortunately, the nodes had no
communication and because the first node is much busier, its counter is
way advanced ... this would mean the 10 minutes later update would get
lost in the conflict resolution when the nodes reestablish
communication. They would have the same data at the end, just not what
any sane person would expect.

This behavior will kick in whenever the cross node conflicting updates
happen close enough so that the time difference between the clocks can
affect it. So if you update the logical same row on two nodes within a
tenth of a second, and the clocks are more than that apart, the conflict
resolution can result in the older row to survive. Clock synchronization
is simply used to minimize this.

The system clock is used only to keep the counters somewhat synchronized
in the case of connection loss to retain some degree of "last update"
meaning. Without that, continuing autonomously during a network outage
is just not practical.

Jan

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 15:53:32
Message-ID:	051DCCC6-7D83-4AA1-B6F6-4035836E56A4@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote:

> On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
>> Jan Wieck wrote:
>>> This is all that is needed for last update wins resolution. And as
>>> said before, the only reason the clock is involved in this is so
>>> that
>>> nodes can continue autonomously when they lose connection without
>>> conflict resolution going crazy later on, which it would do if they
>>> were simple counters. It doesn't require microsecond synchronized
>>> clocks and the system clock isn't just used as a Lamport timestamp.
>> Earlier you said that "one assumption is that all servers in the
>> multimaster cluster are ntp synchronized", which already rung the
>> alarm bells in me. Now that I read this you appear to require
>> synchronization not on the microsecond level but on some level. I
>> think that would be pretty hard to manage for an administrator,
>> seeing that NTP typically cannot provide such guarantees.
>
> Synchronization to some degree is wanted to avoid totally
> unexpected behavior. The conflict resolution algorithm itself can
> perfectly fine live with counters, but I guess you wouldn't want
> the result of it. If you update a record on one node, then 10
> minutes later you update the same record on another node.
> Unfortunately, the nodes had no communication and because the first
> node is much busier, its counter is way advanced ... this would
> mean the 10 minutes later update would get lost in the conflict
> resolution when the nodes reestablish communication. They would
> have the same data at the end, just not what any sane person would
> expect.
>
> This behavior will kick in whenever the cross node conflicting
> updates happen close enough so that the time difference between the
> clocks can affect it. So if you update the logical same row on two
> nodes within a tenth of a second, and the clocks are more than that
> apart, the conflict resolution can result in the older row to
> survive. Clock synchronization is simply used to minimize this.
>
> The system clock is used only to keep the counters somewhat
> synchronized in the case of connection loss to retain some degree
> of "last update" meaning. Without that, continuing autonomously
> during a network outage is just not practical.

A Lamport clock addresses this. It relies on a cluster-wise clock
tick. While it could be based on the system clock, it would not be
based on more than one clock. The point of the lamport clock is that
there is _a_ clock, not multiple ones.

One concept is to have a univeral clock that ticks forward (like
every second) and each node orders all their transactions inside the
second-granular tick. Then each commit would be like: {node,
clocksecond, txn#} and each time the clock ticks forward, txn# is
reset to zero. This gives you ordered txns that windowed in some
cluster-wide acceptable window (1 second). However, this is totally
broken as NTP is entirely insufficient for this purpose because of a
variety of forms of clock skew. As such, the timestamp should be
incremented via cluster consensus (one token ring or the pulse
generated by the leader of the current cluster membership quorom).

As the clock must be incremented clusterwide, the need for it to be
insync with the system clock (on any or all of the systems) is
obviated. In fact, as you can't guarantee the synchronicity means
that it can be confusing -- one expects a time-based clock to be
accurate to the time. A counter-based clock has no such expectations.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Theo Schlossnagle" <jesus(at)omniti(dot)com>
Cc:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jim Nasby" <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 16:34:00
Message-ID:	87y7nd287b.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Theo Schlossnagle" <jesus(at)omniti(dot)com> writes:

> As the clock must be incremented clusterwide, the need for it to be insync with
> the system clock (on any or all of the systems) is obviated. In fact, as you
> can't guarantee the synchronicity means that it can be confusing -- one
> expects a time-based clock to be accurate to the time. A counter-based clock
> has no such expectations.

So if the nodes get split they can keep operating independently but clients
can see that there's no guarantee of ordering against transactions from other
nodes because the clock isn't advancing?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Theo Schlossnagle <jesus(at)omniti(dot)com>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 18:36:03
Message-ID:	45C62793.5000607@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
> As the clock must be incremented clusterwide, the need for it to be
> insync with the system clock (on any or all of the systems) is
> obviated. In fact, as you can't guarantee the synchronicity means
> that it can be confusing -- one expects a time-based clock to be
> accurate to the time. A counter-based clock has no such expectations.

For the fourth time, the clock is in the mix to allow to continue during
a network outage. All your arguments seem to assume 100% network uptime.
There will be no clusterwide clock or clusterwide increment when you
lose connection. How does your idea cope with that?

Obviously the counters will immediately drift apart based on the
transaction load of the nodes as soon as the network goes down. And in
order to avoid this "clock" confusion and wrong expectation, you'd
rather have a system with such a simple, non-clock based counter and
accept that it starts behaving totally wonky when the cluster reconnects
after a network outage? I rather confuse a few people than having a last
update wins conflict resolution that basically rolls dice to determine
"last".

Jan

From:	Theo Schlossnagle <jesus(at)omniti(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-04 19:04:27
Message-ID:	E481BDD9-11A3-406B-8B3C-E04DE26A3350@omniti.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:

> On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
>> As the clock must be incremented clusterwide, the need for it to
>> be insync with the system clock (on any or all of the systems)
>> is obviated. In fact, as you can't guarantee the synchronicity
>> means that it can be confusing -- one expects a time-based clock
>> to be accurate to the time. A counter-based clock has no such
>> expectations.
>
> For the fourth time, the clock is in the mix to allow to continue
> during a network outage. All your arguments seem to assume 100%
> network uptime. There will be no clusterwide clock or clusterwide
> increment when you lose connection. How does your idea cope with that?

That's exactly what a quorum algorithm is for.

> Obviously the counters will immediately drift apart based on the
> transaction load of the nodes as soon as the network goes down. And
> in order to avoid this "clock" confusion and wrong expectation,
> you'd rather have a system with such a simple, non-clock based
> counter and accept that it starts behaving totally wonky when the
> cluster reconnects after a network outage? I rather confuse a few
> people than having a last update wins conflict resolution that
> basically rolls dice to determine "last".

If your cluster partition and you have hours of independent action
and upon merge you apply a conflict resolution algorithm that has
enormous effect undoing portions of the last several hours of work on
the nodes, you wouldn't call that "wonky?"

For sane disconnected (or more generally, partitioned) operation in
multi-master environments, a quorum for the dataset must be
established. Now, one can consider the "database" to be the
dataset. So, on network partitions those in "the" quorum are allowed
to progress with data modification and others only read. However,
there is no reason why the dataset _must_ be the database and that
multiple datasets _must_ share the same quorum algorithm. You could
easily classify certain tables or schema or partitions into a
specific dataset and apply a suitable quorum algorithm to that and a
different quorum algorithm to other disjoint data sets.

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Theo Schlossnagle" <jesus(at)omniti(dot)com>
Cc:	"Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jim Nasby" <decibel(at)decibel(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-05 11:20:25
Message-ID:	E1539E0ED7043848906A8FF995BDA57901B7D76A@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> One concept is to have a univeral clock that ticks forward (like
> every second) and each node orders all their transactions inside the
> second-granular tick. Then each commit would be like: {node,
> clocksecond, txn#} and each time the clock ticks forward, txn# is
> reset to zero. This gives you ordered txns that windowed in some
> cluster-wide acceptable window (1 second). However, this is totally
> broken as NTP is entirely insufficient for this purpose because of a
> variety of forms of clock skew. As such, the timestamp should be
> incremented via cluster consensus (one token ring or the pulse
> generated by the leader of the current cluster membership quorom).

I think you are completely ignoring practicability. Or are you saying,
that such a system exists and works for e.g. a loosly connected group of
laptop field agents that only sporadically have a connection to the
cluster.

I think Jan's definition gives a pragmatic solution to the problem,
and will be able to give "good" automatic conflict resolution.

It has downsides he stated, and cannot guarantee 100% correct automatic
conflict
resolution in case of connection loss, but I am quite sure you are not
able to do
better, without loosing yourself in theory.

e.g. assume all clocks vary by no more than 30 seconds when
disconnected, you can
require manual (or rule based) resolution to all conflicts that vary by
less than
1 minute.

Andreas

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-05 22:38:17
Message-ID:	20070205223817.GC24413@phlogiston.dyndns.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 04, 2007 at 01:36:03PM -0500, Jan Wieck wrote:
> For the fourth time, the clock is in the mix to allow to continue during
> a network outage. All your arguments seem to assume 100% network uptime.
> There will be no clusterwide clock or clusterwide increment when you
> lose connection. How does your idea cope with that?

I'm wondering whether a combined approach is needed. This makes
things more complicated, but what if you somehow co-ordinate local
counters with shared clock ticks? When you get a failure on your
talk to the shared clock, you regard yourself as in some sort of
failure (you're going to need softfails and that sort of thing, and
yes, I'm flapping my hands in the air at the moment). At rejoin to
the cluster, you need some sort of way to publish "here's the counter
and the last global time I had" and "here's my current counter". You
can publish local time with this too, I guess, to solve for conflict
cases, but that seems like the sort of decision that needs to be
pushed down to policy level.

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
The whole tendency of modern prose is away from concreteness.
--George Orwell

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Theo Schlossnagle <jesus(at)omniti(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-06 07:36:31
Message-ID:	45C82FFF.1010605@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Theo Schlossnagle wrote:
> On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
>> Obviously the counters will immediately drift apart based on the
>> transaction load of the nodes as soon as the network goes down. And in
>> order to avoid this "clock" confusion and wrong expectation, you'd
>> rather have a system with such a simple, non-clock based counter and
>> accept that it starts behaving totally wonky when the cluster
>> reconnects after a network outage? I rather confuse a few people than
>> having a last update wins conflict resolution that basically rolls
>> dice to determine "last".
>
> If your cluster partition and you have hours of independent action and
> upon merge you apply a conflict resolution algorithm that has enormous
> effect undoing portions of the last several hours of work on the nodes,
> you wouldn't call that "wonky?"

You are talking about different things. Async replication, as Jan is
planning to do, is per se "wonky", because you have to cope with
conflicts by definition. And you have to resolve them by late-aborting a
transaction (i.e. after a commit). Or put it another way: async MM
replication means continuing in disconnected mode (w/o quorum or some
such) and trying to reconciliate later on. It should not matter if the
delay is just some milliseconds of network latency or three days (except
of course that you probably have more data to reconciliate).

> For sane disconnected (or more generally, partitioned) operation in
> multi-master environments, a quorum for the dataset must be
> established. Now, one can consider the "database" to be the dataset.
> So, on network partitions those in "the" quorum are allowed to progress
> with data modification and others only read.

You can do this to *prevent* conflicts, but that clearly belongs to the
world of sync replication. I'm doing this in Postgres-R: in case of
network partitioning, only a primary partition may continue to process
writing transactions. For async replication, it does not make sense to
prevent conflicts when disconnected. Async is meant to cope with
conflicts. So as to be independent of network latency.

> However, there is no
> reason why the dataset _must_ be the database and that multiple datasets
> _must_ share the same quorum algorithm. You could easily classify
> certain tables or schema or partitions into a specific dataset and apply
> a suitable quorum algorithm to that and a different quorum algorithm to
> other disjoint data sets.

I call that partitioning (among nodes). And it's applicable to sync as
well as async replication, while it makes more sense in sync replication.

What I'm more concerned about, with Jan's proposal, is the assumption
that you always want to resolve conflicts by time (except for balances,
for which we don't have much information, yet). I'd rather say that time
does not matter much if your nodes are disconnected. And (especially in
async replication) you should prevent your clients from committing to
one node and then reading from another, expecting to find your data
there. So why resolve by time? It only makes the user think you could
guarantee that order, but you certainly cannot.

Regards

Markus

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Theo Schlossnagle" <jesus(at)omniti(dot)com>
Cc:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jim Nasby" <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-06 09:25:29
Message-ID:	E1539E0ED7043848906A8FF995BDA57901C12E1A@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What I'm more concerned about, with Jan's proposal, is the assumption
> that you always want to resolve conflicts by time (except for
> balances,
> for which we don't have much information, yet). I'd rather

Um, I think the proposal was only for beneficial backend functionality
for replication in general and time based conflict resolution. And "time
based"
is surely one of the important conflict resolution methods for async MM
replication.

Sure there are others, like "rule based" "priority based" but I think
you don't need additional backend functionality for those.

Andreas

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
Cc:	Theo Schlossnagle <jesus(at)omniti(dot)com>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-06 16:44:33
Message-ID:	45C8B071.8090401@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Zeugswetter Andreas ADI SD wrote:
> And "time based"
> is surely one of the important conflict resolution methods for async MM
> replication.

That's what I'm questioning. Wouldn't any other deterministic, but
seemingly random abort decision be as clever as time based conflict
resolution? It would then be clear to the user that it's random and not
some "in most cases time based, but no in others and only if..." thing.

> Sure there are others, like "rule based" "priority based" but I think
> you don't need additional backend functionality for those.

Got the point, yes. I'm impatient, sorry.

Neither the less, I'm questioning if is it worth adding backend
functionality for that. And given this probably is the most wanted
resolution method, this question might be "heretical". You could also
see it as sort of an user educating question: don't favor time based
resolution if that's the one resolution method with the most traps.

Regards

Markus

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 00:29:11
Message-ID:	45C91D57.3020403@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/6/2007 11:44 AM, Markus Schiltknecht wrote:
> Hi,
>
> Zeugswetter Andreas ADI SD wrote:
>> And "time based"
>> is surely one of the important conflict resolution methods for async MM
>> replication.
>
> That's what I'm questioning. Wouldn't any other deterministic, but
> seemingly random abort decision be as clever as time based conflict
> resolution? It would then be clear to the user that it's random and not
> some "in most cases time based, but no in others and only if..." thing.
>
>> Sure there are others, like "rule based" "priority based" but I think
>> you don't need additional backend functionality for those.
>
> Got the point, yes. I'm impatient, sorry.
>
> Neither the less, I'm questioning if is it worth adding backend
> functionality for that. And given this probably is the most wanted
> resolution method, this question might be "heretical". You could also
> see it as sort of an user educating question: don't favor time based
> resolution if that's the one resolution method with the most traps.

These are all very good suggestions towards additional conflict
resolution mechanisms, that solve one or the other problem. As we have
said for years now, one size will not fit all. What I am after for the
moment is a system that supports by default a last update wins on the
row level, where last update certainly is a little fuzzy, but not by
minutes. Plus balance type columns. A balance column is not propagated
as a new value, but as a delta between the old and the new value. All
replica will apply the delta to that column regardless of whether the
replication info is newer or older than the existing row. That way,
literal value type columns (like an address) will maintain cluster wide
the value of the last update to the row, while balance type columns will
clusterwide maintain the sum of all changes.

Whatever strategy one will use, in an async multimaster there are always
cases that can be resolved by rules (last update being one of them), and
some that I can't even imagine solving so far. I guess some of the cases
will simply boil down to "the application has to make sure that ...
never occurs". Think of a multi-item order, created on one node, while
another node is deleting the long unused item (which would have to be
backordered). Now while those two nodes figure out what to do to make
this consistent again, a third node does a partial shipment of that
order. The solution is simple, reinsert the deleted item ... only that
there were rather nasty ON DELETE CASCADE's on that item that removed
all the consumer reviews, product descriptions, data sheets and what
not. It's going to be an awful lot of undo.

I haven't really made up my mind about a user defined rule based
conflict resolution interface yet. I do plan to have a unique and
foreign key constraint based, synchronous advisory locking system on top
of my system in a later version (advisory key locks would stay in place
until the transaction, that placed them, replicates).

I guess you see by now why I wanted to keep the discussion about the
individual, rather generic support features in the backend separate from
the particular features I plan to implement in the replication system.
Everyone has different needs and consequently an async multi-master
"must" do a whole range of mutually exclusive things altogether ...
because Postgres can never accept a partial solution. We want the egg
laying milk-wool-pig or nothing.

Jan

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>
Cc:	"Theo Schlossnagle" <jesus(at)omniti(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 03:31:55
Message-ID:	D08371E0-2C8E-40C1-B680-C493E3FAE903@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Something worth noting... the only places I've actually seen MM
replication implemented, each master was in fact still responsible
for it's own set of data. It was essentially something that you could
really do with Slony, if you could tolerate the extreme complexity
that would be involved. It might well be worth focusing on that case
first, before trying to come up with a perfect last-committed mechanism.

On Feb 5, 2007, at 5:20 AM, Zeugswetter Andreas ADI SD wrote:

> I think you are completely ignoring practicability. Or are you saying,
> that such a system exists and works for e.g. a loosly connected
> group of
> laptop field agents that only sporadically have a connection to the
> cluster.
>
> I think Jan's definition gives a pragmatic solution to the problem,
> and will be able to give "good" automatic conflict resolution.
>
> It has downsides he stated, and cannot guarantee 100% correct
> automatic
> conflict
> resolution in case of connection loss, but I am quite sure you are not
> able to do
> better, without loosing yourself in theory.
>
> e.g. assume all clocks vary by no more than 30 seconds when
> disconnected, you can
> require manual (or rule based) resolution to all conflicts that
> vary by
> less than
> 1 minute.

--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 07:37:59
Message-ID:	45C981D7.7090908@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Jan Wieck wrote:
> Whatever strategy one will use, in an async multimaster there are always
> cases that can be resolved by rules (last update being one of them), and
> some that I can't even imagine solving so far. I guess some of the cases
> will simply boil down to "the application has to make sure that ...
> never occurs". Think of a multi-item order, created on one node, while
> another node is deleting the long unused item (which would have to be
> backordered). Now while those two nodes figure out what to do to make
> this consistent again, a third node does a partial shipment of that
> order.

It helps to categorize these conflict types. There basically are:

* data conflicts: simple row data, i.e. update - update conflicts.

* uniqueness conflicts: two rows conflict because they'd violate a
uniquenes constraint, i.e. insert - insert, update - insert or update -
update.

* visibility conflicts: basically the remaining update - delete and
delete - delete cases. But also SELECT FOR UPDATE candidates, etc...
Everything having to do with a rows not yet or no longer being visible
to a transaction.

Your example certainly involves a visibility conflict (update - delete).
Not even (sync) Postgres-R can guarantee consistency on the visibility
level, i.e. a first transaction's SELECT FOR UPDATE might not see some
just recently committed transactions newly inserted rows (because that
one isn't replayed yet on the node, thus the transaction is working on
an 'old' snapshot of the database state). Another simpler example:
Postgres-R doesn't raise a serialization error on delete-delete
conflicts, it simply deletes the row once, even if two transactions
confirmed to have committed a transaction which deleted a row.

Luckily, most applications don't need that anyway, though.

> The solution is simple, reinsert the deleted item ...

..at which point timestamps certainly won't help :-) Sorry, couldn't
resist...

> only that
> there were rather nasty ON DELETE CASCADE's on that item that removed
> all the consumer reviews, product descriptions, data sheets and what
> not. It's going to be an awful lot of undo.

Huh? Are you planning on aborting *parts* of a transaction? I didn't
think about that, but my gut feeling is that you don't want to do that.

> I haven't really made up my mind about a user defined rule based
> conflict resolution interface yet. I do plan to have a unique and
> foreign key constraint based, synchronous advisory locking system on top
> of my system in a later version (advisory key locks would stay in place
> until the transaction, that placed them, replicates).

You'd have to elaborate on that...

> I guess you see by now why I wanted to keep the discussion about the
> individual, rather generic support features in the backend separate from
> the particular features I plan to implement in the replication system.

Sure. I know, discussions about replication can get endless, probably
even are so by definition ;-) But hey, they're fun!

> Everyone has different needs and consequently an async multi-master
> "must" do a whole range of mutually exclusive things altogether ...
> because Postgres can never accept a partial solution. We want the egg
> laying milk-wool-pig or nothing.

Like the one which would result from a merge of such an async
replication with a sync one? Imagine being able to choose between sync
and async per transaction...

Regards

Markus

From:	José Orlando Pereira <jop(at)lsd(dot)di(dot)uminho(dot)pt>
To:	pgsql-hackers(at)postgresql(dot)org, Alfranio Correia <alfranio(at)lsd(dot)di(dot)uminho(dot)pt>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 12:13:36
Message-ID:	200702071213.38463.jop@lsd.di.uminho.pt
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday 03 February 2007, Bruce Momjian wrote:
> Jan Wieck wrote:
> > I don't have any such paper and the proof of concept will be the
> > implementation of the system. I do however see enough resistance against
> > this proposal to withdraw the commit timestamp at this time. The new
> > replication system will therefore require the installation of a patched,
> > non-standard PostgreSQL version, compiled from sources cluster wide in
> > order to be used. I am aware that this will dramatically reduce it's
> > popularity but it is impossible to develop this essential feature as an
> > external module.
> >
> > I thank everyone for their attention.
>
> Going and working on it on your own doesn't seem like the proper
> solution. I don't see people objecting to adding it, but they want it
> work, which I am sure you want too. You have to show how it will work
> and convince others of that, and then you have a higher chance it will
> work, and be in the PostgreSQL codebase.

Hi,

Would it be possible to solve the problem using the GORDA on-commit hook?

Jan would be able reliably obtain a commit timestamp with the desired
semantics and store it in a regular table within transaction boundaries.

PostgreSQL would not have to commit to a specific timestamp semantics and the
patch is quite small.

Regards,

--
Jose Orlando Pereira

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 16:54:28
Message-ID:	45CA0444.1010201@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 2:37 AM, Markus Schiltknecht wrote:
> Hi,
>
> Jan Wieck wrote:
>> Whatever strategy one will use, in an async multimaster there are always
>> cases that can be resolved by rules (last update being one of them), and
>> some that I can't even imagine solving so far. I guess some of the cases
>> will simply boil down to "the application has to make sure that ...
>> never occurs". Think of a multi-item order, created on one node, while
>> another node is deleting the long unused item (which would have to be
>> backordered). Now while those two nodes figure out what to do to make
>> this consistent again, a third node does a partial shipment of that
>> order.
>
> It helps to categorize these conflict types. There basically are:

Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster? It seems more like we are drifting into what type of
replication system I should design to please most people.

Jan

>
> * data conflicts: simple row data, i.e. update - update conflicts.
>
> * uniqueness conflicts: two rows conflict because they'd violate a
> uniquenes constraint, i.e. insert - insert, update - insert or update -
> update.
>
> * visibility conflicts: basically the remaining update - delete and
> delete - delete cases. But also SELECT FOR UPDATE candidates, etc...
> Everything having to do with a rows not yet or no longer being visible
> to a transaction.
>
> Your example certainly involves a visibility conflict (update - delete).
> Not even (sync) Postgres-R can guarantee consistency on the visibility
> level, i.e. a first transaction's SELECT FOR UPDATE might not see some
> just recently committed transactions newly inserted rows (because that
> one isn't replayed yet on the node, thus the transaction is working on
> an 'old' snapshot of the database state). Another simpler example:
> Postgres-R doesn't raise a serialization error on delete-delete
> conflicts, it simply deletes the row once, even if two transactions
> confirmed to have committed a transaction which deleted a row.
>
> Luckily, most applications don't need that anyway, though.
>
>> The solution is simple, reinsert the deleted item ...
>
> ..at which point timestamps certainly won't help :-) Sorry, couldn't
> resist...
>
>> only that
>> there were rather nasty ON DELETE CASCADE's on that item that removed
>> all the consumer reviews, product descriptions, data sheets and what
>> not. It's going to be an awful lot of undo.
>
> Huh? Are you planning on aborting *parts* of a transaction? I didn't
> think about that, but my gut feeling is that you don't want to do that.
>
>> I haven't really made up my mind about a user defined rule based
>> conflict resolution interface yet. I do plan to have a unique and
>> foreign key constraint based, synchronous advisory locking system on top
>> of my system in a later version (advisory key locks would stay in place
>> until the transaction, that placed them, replicates).
>
> You'd have to elaborate on that...
>
>> I guess you see by now why I wanted to keep the discussion about the
>> individual, rather generic support features in the backend separate from
>> the particular features I plan to implement in the replication system.
>
> Sure. I know, discussions about replication can get endless, probably
> even are so by definition ;-) But hey, they're fun!
>
>> Everyone has different needs and consequently an async multi-master
>> "must" do a whole range of mutually exclusive things altogether ...
>> because Postgres can never accept a partial solution. We want the egg
>> laying milk-wool-pig or nothing.
>
> Like the one which would result from a merge of such an async
> replication with a sync one? Imagine being able to choose between sync
> and async per transaction...
>
> Regards
>
> Markus
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 17:54:56
Message-ID:	45CA1270.5030107@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Jan Wieck wrote:
> Are we still discussing if the Postgres backend may provide support for
> a commit timestamp, that follows the rules for Lamport timestamps in a
> multi-node cluster?

No. And I think you know my opinion about that by now. ;-)

> It seems more like we are drifting into what type of
> replication system I should design to please most people.

Nobody is telling you what you should do. You're free to do whatever you
want to.

I'm only trying to get a discussion going, because a) I'm interested in
how you plan to solve these problems and b) in the past, most people
were complaining that all the different replication efforts didn't try
to work together. I'm slowly trying to open up and discuss what I'm
doing with Postgres-R on the lists.

Just yesterday at the SFPUG meeting, I've experienced how confusing it
is for the users to have such a broad variety of (existing and upcoming)
replication solutions. And I'm all for working together and probably
even for merging different replication solutions.

Regards

Markus

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-07 19:15:31
Message-ID:	Pine.LNX.4.33.0702071046230.23745-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Jan Wieck wrote:
> > Are we still discussing if the Postgres backend may provide support for
> > a commit timestamp, that follows the rules for Lamport timestamps in a
> > multi-node cluster?

...I thought you said in this thread that you haven't and weren't going to
work on any kind of logical proof of it's correctness, saw no value in
prototyping your way to a clear (convincing) argument, and were
withdrawing the proposal due to all the issues others raised which were,
in light of this, unanswerable beyond conjecture. I thought that the
thread was continuing because other people saw value in the kernel of the
idea, would support if if it could be shown to be correct/useful, were
disappointed you'd leave it at that and wanted to continue to see if
something positive might come of the dialogue. So, the thread weaved
around a bit. I think that if you want to nail this down, people here are
willing to be convinced, but that hasn't happened yet.

On Wed, 7 Feb 2007, Markus Schiltknecht wrote:
> I'm only trying to get a discussion going, because a) I'm interested in
> how you plan to solve these problems and b) in the past, most people
> were complaining that all the different replication efforts didn't try
> to work together. I'm slowly trying to open up and discuss what I'm
> doing with Postgres-R on the lists.
>
> Just yesterday at the SFPUG meeting, I've experienced how confusing it
> is for the users to have such a broad variety of (existing and upcoming)
> replication solutions. And I'm all for working together and probably
> even for merging different replication solutions.

In support of that idea, I offer this; When Randy Eash wrote the world's
first replication system for Ingres circa 1990, his work included ideas
and features that are right now in the Postgres world fragmented among
several existing replication / replication-related products, along with
some things that are only now in discussion in this group. As discussed at
the SFPUG meeting last night, real-world use cases are seldom if ever
completely satisfied with a one-size-fits-all replication strategy. For
example, a manufacturing company might want all factories to be capable of
being autonomous but both report activities and take direction from
corporate headquarters. To do this without having multiple databases at
each site, a single database instance would likely be both a master and
slave, but for differing aspects of the businesses needs. Business
decisions would resolve the conflicts, say, the manufacturing node always
wins when it comes to data that pertains to their work, rather than
something like a time-stamp, last timestamp/serialized update wins.

Like Markus, I would like to see the various replication efforts merged as
best they can be because even if the majority of users don't use a little
bit of everything, surely the more interesting cases would like to and the
entire community is better served if the various "solutions" are in
harmony.

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 01:13:10
Message-ID:	45CA7926.6070905@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 12:54 PM, Markus Schiltknecht wrote:
> Hi,
>
> Jan Wieck wrote:
>> Are we still discussing if the Postgres backend may provide support for
>> a commit timestamp, that follows the rules for Lamport timestamps in a
>> multi-node cluster?
>
> No. And I think you know my opinion about that by now. ;-)

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things)
is a two node multimaster system. One is serving the customer web
portal, the other is used by the company staff including the call
center. At 13:45 the two servers lose connectivity to each other, yet
the internal staff can access the internal server while the web portal
is accessible from the outside. At 13:50 customer A updates their credit
card information through the web portal, while customer B does the same
through the call center. At 13:55 both customers change their mind to
use yet another credit card, now customer A phones the call center while
customer B does it via the internet.

At 14:00 the two servers reconnect and go through the conflict
resolution. How do you intend to solve both conflicts without using any
"clock", because that seems to be a stopword causing instant rejection
of whatever you propose. Needless to say, both customers will be
dissatisfied if you charge the "wrong" credit card during your next
billing cycle.

>
>> It seems more like we are drifting into what type of
>> replication system I should design to please most people.
>
> Nobody is telling you what you should do. You're free to do whatever you
> want to.
>
> I'm only trying to get a discussion going, because a) I'm interested in
> how you plan to solve these problems and b) in the past, most people
> were complaining that all the different replication efforts didn't try
> to work together. I'm slowly trying to open up and discuss what I'm
> doing with Postgres-R on the lists.

Which is a good discussion because one of the reasons why I stopped
looking into Postgres-R is the fact that is based on the idea to push
all the replication information through a system that generates a global
serialized message queue. That by itself isn't the problem, but the fact
that implementing a global serialized message queue has serious
throughput issues that are (among other details) linked to the speed of
light.

I am trying to start with a system, that doesn't rely on such a
mechanism for everything. I do intend to add an option later, that
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE
will require the node to currently be a member of the (quorum or
priority defined) majority of the cluster. An advisory lock system,
based on a total order group communication, will grant the lock to the
unique key values on a first come, first serve base. Every node in the
cluster will keep those keys as "locked" until the asynchronous
replication stream reports the locking transaction as ended. If another
remote transaction in the meantime requires updating such key, the
incoming stream from that node will be on hold until the lock is
cleared. This is to protect agains node B replicating a transaction from
node A and a later update on node B arrives on C before C got the first
event from A. A node that got disconnected from the cluster must rebuild
the current advisory lock list upon reconnecting to the cluster.

I think that this will be a way to overcome Postgres-R's communication
bottleneck, as well as allowing limited update activity even during a
completely disconnected state of a node. Synchronous or group
communication messages are reduced to the cases, where the application
cannot be implemented in a conflict free way, like allocating a natural
primary key. There is absolutely no need to synchronize for example
creating a sales order. An application can use global unique ID's for
the order number. And everything possibly referenced by an order (items,
customers, ...) is stored in a way that the references are never
updated. Deletes to those possibly referenced objects are implemented in
a two step process, where they are first marked obsolete, and later on
things that have been marked obsolete for X long are deleted. A REPLICA
TRIGGER on inserting an order will simply reset the obsolete flag of
referenced objects. If a node is disconnected longer than X, you have a
problem - hunt down the guy who defined X.

> Just yesterday at the SFPUG meeting, I've experienced how confusing it
> is for the users to have such a broad variety of (existing and upcoming)
> replication solutions. And I'm all for working together and probably
> even for merging different replication solutions.

Merging certain ideas to come up with an async/sync hybrid? Seems to me
we have similar enough ideas to need conflict resolution, because we had
them simultaneously but communicate them asynchronously.

Jan

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 02:27:52
Message-ID:	45CA8AA8.9070805@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Jan Wieck wrote:
> Then let me give you a little puzzle just for the fun of it.
>
> A database containing customer contact information (among other things)
> is a two node multimaster system. One is serving the customer web
> portal, the other is used by the company staff including the call
> center. At 13:45 the two servers lose connectivity to each other, yet
> the internal staff can access the internal server while the web portal
> is accessible from the outside. At 13:50 customer A updates their credit
> card information through the web portal, while customer B does the same
> through the call center. At 13:55 both customers change their mind to
> use yet another credit card, now customer A phones the call center while
> customer B does it via the internet.

Phew, a mind twister... one customer would already be enough to trigger
that sort of conflict...

> At 14:00 the two servers reconnect and go through the conflict
> resolution. How do you intend to solve both conflicts without using any
> "clock", because that seems to be a stopword causing instant rejection
> of whatever you propose. Needless to say, both customers will be
> dissatisfied if you charge the "wrong" credit card during your next
> billing cycle.

Correct. But do these cases satisfy storing timestamps to each and every
transaction you do? That's what I doubt, not the usefulness of time
based conflict resolution for certain cases.

You can always add a time based conflict resolution, by adding a
timestamp column and decide upon that one. I'd guess that the overall
costs are lower that way.

But you've withdrawn that proposal already, so...

> Which is a good discussion because one of the reasons why I stopped
> looking into Postgres-R is the fact that is based on the idea to push
> all the replication information through a system that generates a global
> serialized message queue. That by itself isn't the problem, but the fact
> that implementing a global serialized message queue has serious
> throughput issues that are (among other details) linked to the speed of
> light.

Agreed. Nevertheless, there are use cases for such systems, because they
put less limitations to the application. One could even argue, that your
above example would be one ;-)

> I am trying to start with a system, that doesn't rely on such a
> mechanism for everything. I do intend to add an option later, that
> allows to declare a UNIQUE NOT NULL constraint to be synchronous. What
> that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE
> will require the node to currently be a member of the (quorum or
> priority defined) majority of the cluster.

Sounds reasonable.

> An advisory lock system,
> based on a total order group communication, will grant the lock to the
> unique key values on a first come, first serve base. Every node in the
> cluster will keep those keys as "locked" until the asynchronous
> replication stream reports the locking transaction as ended. If another
> remote transaction in the meantime requires updating such key, the
> incoming stream from that node will be on hold until the lock is
> cleared. This is to protect agains node B replicating a transaction from
> node A and a later update on node B arrives on C before C got the first
> event from A. A node that got disconnected from the cluster must rebuild
> the current advisory lock list upon reconnecting to the cluster.

Yeah, this is a convenient way to replicate sequences via a GCS.

> I think that this will be a way to overcome Postgres-R's communication
> bottleneck, as well as allowing limited update activity even during a
> completely disconnected state of a node. Synchronous or group
> communication messages are reduced to the cases, where the application
> cannot be implemented in a conflict free way, like allocating a natural
> primary key. There is absolutely no need to synchronize for example
> creating a sales order.

Agreed, such cases can easily be optimized. But you have to be aware of
he limitations these optimizations cause. Postgres-R is much more
targeted at very general use cases.

> An application can use global unique ID's for
> the order number. And everything possibly referenced by an order (items,
> customers, ...) is stored in a way that the references are never
> updated. Deletes to those possibly referenced objects are implemented in
> a two step process, where they are first marked obsolete, and later on
> things that have been marked obsolete for X long are deleted. A REPLICA
> TRIGGER on inserting an order will simply reset the obsolete flag of
> referenced objects. If a node is disconnected longer than X, you have a
> problem - hunt down the guy who defined X.

Yeah, that's another very nice optimization. Again, as long as you know
the limitations, that's all well and fine.

> Merging certain ideas to come up with an async/sync hybrid? Seems to me
> we have similar enough ideas to need conflict resolution, because we had
> them simultaneously but communicate them asynchronously.

Huh? Sorry, I didn't get what you're trying to say here.

Regards

Markus

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Richard Troy <rtroy(at)sciencetools(dot)com>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 03:13:22
Message-ID:	45CA9552.1020206@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 2:15 PM, Richard Troy wrote:
>> Jan Wieck wrote:
>> > Are we still discussing if the Postgres backend may provide support for
>> > a commit timestamp, that follows the rules for Lamport timestamps in a
>> > multi-node cluster?
>
> ...I thought you said in this thread that you haven't and weren't going to
> work on any kind of logical proof of it's correctness, saw no value in
> prototyping your way to a clear (convincing) argument, and were
> withdrawing the proposal [...]

I said I don't have any such documents. I was asked to continue this
discussion in order to find people willing to help discover potential
problems. I am prepared to continue this development isolated, although
I wouldn't like to.

The PostgreSQL developers community used to be good at throwing out
ideas, brainstorming about the possibilities, adding more to them and
coming up with very unique and flexible solutions. I am a little
disappointed that much of that got lost over the years and please
forgive me if I sound a little grumpy over that. The statement to
withdraw the proposal was certainly premature - consider it not
withdrawn at this time. However, comparing what used to be our process
to what I see today, I must say that something like TOAST would never
have happened. It was the result of a global brainstorming, that I
simply translated into C code. Many details and features of the
implementation are purely mine, but the really big sparks, that got it
to what it is, I'm not claiming for myself. Most importantly, "give me
proof of concept before we can talk about changing backend code" was not
part of the process at all. We were pretty eager to change things back
then, when we needed to get better in almost every way possible ... are
we so good at replication that we need to be conservative in that
respect now? We are certainly good at some things and have to be
conservative with respect to them, but replication in my not so very
humble opinion isn't one of them.

I do understand that we have a codebase used in production these days.
And because of that we have to maintain code and syntax stability to a
degree, we didn't have back in the glory days of introducing EXCEPT and
INTERCEPT (who's first incarnation was committed to the code base while
completely destroying my entire work of fixing the rewriter). Maybe we
need to introduce something entirely different, like the concept of an
experimental feature. Something that we add to the code but that is
explicitly flagged as not final, not stable, not guaranteed to stay or
work in this or any other form. This requires that the feature has very
limited interference with other parts of the system, like (or especially
like) the query parser. If it turns out to be a problem in x.y.0, it
will be backed out and gone in x.y.1. Or in a different way, like we
create an experimental CVS branch off of every major release. That way,
developers can easier share experimental code and if things settle
there, they will be considered to be adopted into HEAD.

> Like Markus, I would like to see the various replication efforts merged as
> best they can be because even if the majority of users don't use a little
> bit of everything, surely the more interesting cases would like to and the
> entire community is better served if the various "solutions" are in
> harmony.

No doubt about that and I was the one organizing the Afilias sponsored
meeting in Toronto back then, where my reversed Postgres-R idea was
taken apart because it won't scale due to the gigantic amount of
synchronized group communication it would require. Again, it might be
that experimental features will cause more of the efforts to converge by
using the same base as a compromise instead of having each and every
support feature being designed completely independent.

I still have a hard time understanding why someone would object to
adding a feature, however useless it might seem to them, as long as it
doesn't cost them anything. Admitted, any feature causes maintenance
costs on the side of the PostgreSQL development community (mainly those,
who actually contribute and maintain the code - fortunately that is a
finite number - everyone please ask themselves if they are part of
that). But aside from that, would anyone, who is questioning the commit
timestamp as I proposed it, likewise vehemently object to yet another
procedural language, or adding another log tuning switch? I don't think
so. As long as it doesn't cost you unless you turn it on, why would you
even care if it serves my purpose or not? The thing that kicked off this
emotional spin was that multimaster replication is what so many people
want, but nobody has a universal solution for. Everyone wants to see
"their" problem solved "as well", or the solution isn't good. Tell you
what, I can live with my problem solved even if it doesn't solve yours.
Can you tell me what I have to modify in order to solve your problem as
well, or are you asking me to not implement anything unless "I" find a
way to solve everyones problems in one, big, universal solution?

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 03:35:02
Message-ID:	45CA9A66.3090105@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 9:27 PM, Markus Schiltknecht wrote:
> Hi,
>
> Jan Wieck wrote:
>> Then let me give you a little puzzle just for the fun of it.
>>
>> A database containing customer contact information (among other things)
>> is a two node multimaster system. One is serving the customer web
>> portal, the other is used by the company staff including the call
>> center. At 13:45 the two servers lose connectivity to each other, yet
>> the internal staff can access the internal server while the web portal
>> is accessible from the outside. At 13:50 customer A updates their credit
>> card information through the web portal, while customer B does the same
>> through the call center. At 13:55 both customers change their mind to
>> use yet another credit card, now customer A phones the call center while
>> customer B does it via the internet.
>
> Phew, a mind twister... one customer would already be enough to trigger
> that sort of conflict...
>
>> At 14:00 the two servers reconnect and go through the conflict
>> resolution. How do you intend to solve both conflicts without using any
>> "clock", because that seems to be a stopword causing instant rejection
>> of whatever you propose. Needless to say, both customers will be
>> dissatisfied if you charge the "wrong" credit card during your next
>> billing cycle.
>
> Correct. But do these cases satisfy storing timestamps to each and every
> transaction you do? That's what I doubt, not the usefulness of time
> based conflict resolution for certain cases.
>
> You can always add a time based conflict resolution, by adding a
> timestamp column and decide upon that one. I'd guess that the overall
> costs are lower that way.

Yes, yes, and yes ... but aside from the problem that you use the very
ambiguous word "timestamp" (which somehow suggests using a "clock" of
some sort), isn't the "begin" timestamp of a long running transaction
worse than the "commit" timestamp, when all its work got visible to the
outside world instantaneously?

>
> But you've withdrawn that proposal already, so...
>
>> Which is a good discussion because one of the reasons why I stopped
>> looking into Postgres-R is the fact that is based on the idea to push
>> all the replication information through a system that generates a global
>> serialized message queue. That by itself isn't the problem, but the fact
>> that implementing a global serialized message queue has serious
>> throughput issues that are (among other details) linked to the speed of
>> light.
>
> Agreed. Nevertheless, there are use cases for such systems, because they
> put less limitations to the application. One could even argue, that your
> above example would be one ;-)

Now we're in sync :-)

>
>> I am trying to start with a system, that doesn't rely on such a
>> mechanism for everything. I do intend to add an option later, that
>> allows to declare a UNIQUE NOT NULL constraint to be synchronous. What
>> that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE
>> will require the node to currently be a member of the (quorum or
>> priority defined) majority of the cluster.
>
> Sounds reasonable.
>
>> An advisory lock system,
>> based on a total order group communication, will grant the lock to the
>> unique key values on a first come, first serve base. Every node in the
>> cluster will keep those keys as "locked" until the asynchronous
>> replication stream reports the locking transaction as ended. If another
>> remote transaction in the meantime requires updating such key, the
>> incoming stream from that node will be on hold until the lock is
>> cleared. This is to protect agains node B replicating a transaction from
>> node A and a later update on node B arrives on C before C got the first
>> event from A. A node that got disconnected from the cluster must rebuild
>> the current advisory lock list upon reconnecting to the cluster.
>
> Yeah, this is a convenient way to replicate sequences via a GCS.
>
>> I think that this will be a way to overcome Postgres-R's communication
>> bottleneck, as well as allowing limited update activity even during a
>> completely disconnected state of a node. Synchronous or group
>> communication messages are reduced to the cases, where the application
>> cannot be implemented in a conflict free way, like allocating a natural
>> primary key. There is absolutely no need to synchronize for example
>> creating a sales order.
>
> Agreed, such cases can easily be optimized. But you have to be aware of
> he limitations these optimizations cause. Postgres-R is much more
> targeted at very general use cases.

I am, if for no other reason than that I am familiar with the concepts
underneath Postgres-R for more than 3 years. What I realized is that the
"general use" case (for arbitrary complex applications) is very likely
to be in conflict with any king of "good default performance" case.

>
>> An application can use global unique ID's for
>> the order number. And everything possibly referenced by an order (items,
>> customers, ...) is stored in a way that the references are never
>> updated. Deletes to those possibly referenced objects are implemented in
>> a two step process, where they are first marked obsolete, and later on
>> things that have been marked obsolete for X long are deleted. A REPLICA
>> TRIGGER on inserting an order will simply reset the obsolete flag of
>> referenced objects. If a node is disconnected longer than X, you have a
>> problem - hunt down the guy who defined X.
>
> Yeah, that's another very nice optimization. Again, as long as you know
> the limitations, that's all well and fine.
>
>> Merging certain ideas to come up with an async/sync hybrid? Seems to me
>> we have similar enough ideas to need conflict resolution, because we had
>> them simultaneously but communicate them asynchronously.
>
> Huh? Sorry, I didn't get what you're trying to say here.

Out of sync again ... we'll get there tomorrow ... unless your clock is
way back and tomorrow will never come.

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 03:35:50
Message-ID:	200702080335.l183ZoW25529@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I find the term "logical proof of it's correctness" too restrictive. It
sounds like some formal academic process that really doesn't work well
for us.

What I did want to hear is a layout of how the system would work, and an
exchange of ideas until almost everyone was happy.

Also, I saw the trigger patch with no explaination of why it was
important or who would use it --- that also isn't going to fly well.

So, to add something, the community needs to hear how it is going to
help users, because every code addition has cost, and we don't want to
add things unless it has general utility. If someone can't explain the
utility of an addition, I question whether the person has fully thought
through were they are going.

As far as adding a language, no, we would not just add any language. We
would judge whether the language has usefulness to our users. I think
APL would be cool, but I am not sure it is usable, so there is a hurdle
even there.

As far as TOAST, there is no question in my mind that TOAST development
would happen the same way today as it did when we did it in 2001 --- we
have a problem, how can we fix it.

---------------------------------------------------------------------------

Jan Wieck wrote:
> On 2/7/2007 2:15 PM, Richard Troy wrote:
> >> Jan Wieck wrote:
> >> > Are we still discussing if the Postgres backend may provide support for
> >> > a commit timestamp, that follows the rules for Lamport timestamps in a
> >> > multi-node cluster?
> >
> > ...I thought you said in this thread that you haven't and weren't going to
> > work on any kind of logical proof of it's correctness, saw no value in
> > prototyping your way to a clear (convincing) argument, and were
> > withdrawing the proposal [...]
>
> I said I don't have any such documents. I was asked to continue this
> discussion in order to find people willing to help discover potential
> problems. I am prepared to continue this development isolated, although
> I wouldn't like to.
>
> The PostgreSQL developers community used to be good at throwing out
> ideas, brainstorming about the possibilities, adding more to them and
> coming up with very unique and flexible solutions. I am a little
> disappointed that much of that got lost over the years and please
> forgive me if I sound a little grumpy over that. The statement to
> withdraw the proposal was certainly premature - consider it not
> withdrawn at this time. However, comparing what used to be our process
> to what I see today, I must say that something like TOAST would never
> have happened. It was the result of a global brainstorming, that I
> simply translated into C code. Many details and features of the
> implementation are purely mine, but the really big sparks, that got it
> to what it is, I'm not claiming for myself. Most importantly, "give me
> proof of concept before we can talk about changing backend code" was not
> part of the process at all. We were pretty eager to change things back
> then, when we needed to get better in almost every way possible ... are
> we so good at replication that we need to be conservative in that
> respect now? We are certainly good at some things and have to be
> conservative with respect to them, but replication in my not so very
> humble opinion isn't one of them.
>
> I do understand that we have a codebase used in production these days.
> And because of that we have to maintain code and syntax stability to a
> degree, we didn't have back in the glory days of introducing EXCEPT and
> INTERCEPT (who's first incarnation was committed to the code base while
> completely destroying my entire work of fixing the rewriter). Maybe we
> need to introduce something entirely different, like the concept of an
> experimental feature. Something that we add to the code but that is
> explicitly flagged as not final, not stable, not guaranteed to stay or
> work in this or any other form. This requires that the feature has very
> limited interference with other parts of the system, like (or especially
> like) the query parser. If it turns out to be a problem in x.y.0, it
> will be backed out and gone in x.y.1. Or in a different way, like we
> create an experimental CVS branch off of every major release. That way,
> developers can easier share experimental code and if things settle
> there, they will be considered to be adopted into HEAD.
>
> > Like Markus, I would like to see the various replication efforts merged as
> > best they can be because even if the majority of users don't use a little
> > bit of everything, surely the more interesting cases would like to and the
> > entire community is better served if the various "solutions" are in
> > harmony.
>
> No doubt about that and I was the one organizing the Afilias sponsored
> meeting in Toronto back then, where my reversed Postgres-R idea was
> taken apart because it won't scale due to the gigantic amount of
> synchronized group communication it would require. Again, it might be
> that experimental features will cause more of the efforts to converge by
> using the same base as a compromise instead of having each and every
> support feature being designed completely independent.
>
> I still have a hard time understanding why someone would object to
> adding a feature, however useless it might seem to them, as long as it
> doesn't cost them anything. Admitted, any feature causes maintenance
> costs on the side of the PostgreSQL development community (mainly those,
> who actually contribute and maintain the code - fortunately that is a
> finite number - everyone please ask themselves if they are part of
> that). But aside from that, would anyone, who is questioning the commit
> timestamp as I proposed it, likewise vehemently object to yet another
> procedural language, or adding another log tuning switch? I don't think
> so. As long as it doesn't cost you unless you turn it on, why would you
> even care if it serves my purpose or not? The thing that kicked off this
> emotional spin was that multimaster replication is what so many people
> want, but nobody has a universal solution for. Everyone wants to see
> "their" problem solved "as well", or the solution isn't good. Tell you
> what, I can live with my problem solved even if it doesn't solve yours.
> Can you tell me what I have to modify in order to solve your problem as
> well, or are you asking me to not implement anything unless "I" find a
> way to solve everyones problems in one, big, universal solution?
>
>
> Jan
>
> --
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me. #
> #================================================== JanWieck(at)Yahoo(dot)com #
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 04:01:31
Message-ID:	45CAA09B.7070204@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> I find the term "logical proof of it's correctness" too restrictive. It
> sounds like some formal academic process that really doesn't work well
> for us.

Thank you.

> Also, I saw the trigger patch with no explaination of why it was
> important or who would use it --- that also isn't going to fly well.

You didn't respond to my explanation how the current Slony
implementation could improve and evolve using it. Are you missing
something? I am discussing this very issue with our own QA department,
and thus far, I think I have a majority of "would use a pg_trigger
backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly
how it corrupted my system catalog".

> As far as TOAST, there is no question in my mind that TOAST development
> would happen the same way today as it did when we did it in 2001 --- we
> have a problem, how can we fix it.

Looking at what did happen back then and what happens in this case, I do
see a difference. There were concerns about the compression algorithm
used ... it still is today what was the first incarnation and nobody
ever bothered to even investigate if there could possibly be any better
thing. Do you think lzcompress is the best we can come up with? I don't!
So why is it still the thing used? Maybe it is good enough?

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 04:12:28
Message-ID:	200702080412.l184CSb06518@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > I find the term "logical proof of it's correctness" too restrictive. It
> > sounds like some formal academic process that really doesn't work well
> > for us.
>
> Thank you.
>
> > Also, I saw the trigger patch with no explaination of why it was
> > important or who would use it --- that also isn't going to fly well.
>
> You didn't respond to my explanation how the current Slony
> implementation could improve and evolve using it. Are you missing
> something? I am discussing this very issue with our own QA department,
> and thus far, I think I have a majority of "would use a pg_trigger
> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly
> how it corrupted my system catalog".

No, I _now_ understand the use case, but when the patch was posted, the
use case was missing. I would like to see a repost with the patch, and
a description of its use so we can all move forward on that.

> > As far as TOAST, there is no question in my mind that TOAST development
> > would happen the same way today as it did when we did it in 2001 --- we
> > have a problem, how can we fix it.
>
> Looking at what did happen back then and what happens in this case, I do
> see a difference. There were concerns about the compression algorithm
> used ... it still is today what was the first incarnation and nobody
> ever bothered to even investigate if there could possibly be any better
> thing. Do you think lzcompress is the best we can come up with? I don't!
> So why is it still the thing used? Maybe it is good enough?

It is simple/stupid enough, I would say, and the compression space is a
mine-field of patents.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Theo Schlossnagle" <jesus(at)omniti(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jim Nasby" <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 09:35:47
Message-ID:	E1539E0ED7043848906A8FF995BDA57901C13135@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Yes, yes, and yes ... but aside from the problem that you use the very

> ambiguous word "timestamp" (which somehow suggests using a "clock" of
> some sort), isn't the "begin" timestamp of a long running transaction

imho a begin timestamp is near useless

> worse than the "commit" timestamp, when all its work got visible to
the
> outside world instantaneously?

This is one of the areas I am still worried about. Is one commit lamport
timestamp enough ?
I think for some conflict resolutions we need to look at the
row level, and resolve conflicts per row and not per transaction
(yes, this means that a tx might get partially replicated).

What I am trying to lead at is: maybe an infrastructure to produce
wieck lamport timestamps, that can be used in different places like
commit hooks and column defaults, would be of more general use. Maybe
such
a column could be a system column that is not visible with "select *"
for those cases where commit is not enough. And a commit hook could
insert it into clog like storage.

Andreas

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 19:14:18
Message-ID:	45CB768A.8080001@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
>> > I find the term "logical proof of it's correctness" too restrictive. It
>> > sounds like some formal academic process that really doesn't work well
>> > for us.
>>
>> Thank you.
>>
>> > Also, I saw the trigger patch with no explaination of why it was
>> > important or who would use it --- that also isn't going to fly well.
>>
>> You didn't respond to my explanation how the current Slony
>> implementation could improve and evolve using it. Are you missing
>> something? I am discussing this very issue with our own QA department,
>> and thus far, I think I have a majority of "would use a pg_trigger
>> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly
>> how it corrupted my system catalog".
>
> No, I _now_ understand the use case, but when the patch was posted, the
> use case was missing. I would like to see a repost with the patch, and
> a description of its use so we can all move forward on that.

Is this a new policy that after discussion, all patches must be
resubmitted with a summary and conclusions of the discussion? I can
certainly do that for you, but just tell me if you are going to ask the
same from everyone.

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 19:38:36
Message-ID:	200702081938.l18JcaZ16220@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> > Jan Wieck wrote:
> >> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> >> > I find the term "logical proof of it's correctness" too restrictive. It
> >> > sounds like some formal academic process that really doesn't work well
> >> > for us.
> >>
> >> Thank you.
> >>
> >> > Also, I saw the trigger patch with no explaination of why it was
> >> > important or who would use it --- that also isn't going to fly well.
> >>
> >> You didn't respond to my explanation how the current Slony
> >> implementation could improve and evolve using it. Are you missing
> >> something? I am discussing this very issue with our own QA department,
> >> and thus far, I think I have a majority of "would use a pg_trigger
> >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly
> >> how it corrupted my system catalog".
> >
> > No, I _now_ understand the use case, but when the patch was posted, the
> > use case was missing. I would like to see a repost with the patch, and
> > a description of its use so we can all move forward on that.
>
> Is this a new policy that after discussion, all patches must be
> resubmitted with a summary and conclusions of the discussion? I can
> certainly do that for you, but just tell me if you are going to ask the
> same from everyone.

No, I am asking only this time because I feel there was too much
disconnect between the patch and the extensive replication discussion
that few community members would see the connection.

I would also like to know what your new features does for each supported
option. I have not seen that spelled out yet at all.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 19:59:37
Message-ID:	20070208195937.GB24069@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Jan Wieck wrote:
> > On 2/7/2007 11:12 PM, Bruce Momjian wrote:
> > > Jan Wieck wrote:
> > >> On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > >>
> > >> > Also, I saw the trigger patch with no explaination of why it was
> > >> > important or who would use it --- that also isn't going to fly well.
> > >>
> > >> You didn't respond to my explanation how the current Slony
> > >> implementation could improve and evolve using it. Are you missing
> > >> something? I am discussing this very issue with our own QA department,
> > >> and thus far, I think I have a majority of "would use a pg_trigger
> > >> backpatched PostgreSQL" vs. "No, I prefer a system that knows exactly
> > >> how it corrupted my system catalog".
> > >
> > > No, I _now_ understand the use case, but when the patch was posted, the
> > > use case was missing. I would like to see a repost with the patch, and
> > > a description of its use so we can all move forward on that.
> >
> > Is this a new policy that after discussion, all patches must be
> > resubmitted with a summary and conclusions of the discussion? I can
> > certainly do that for you, but just tell me if you are going to ask the
> > same from everyone.
>
> No, I am asking only this time because I feel there was too much
> disconnect between the patch and the extensive replication discussion
> that few community members would see the connection.

FYI, in my opinion the trigger addition is clearly useful to Mammoth
Replicator as well. In fact, it's so obviously useful that I didn't see
a need to state that in the original thread where it was discussed.

Not sure about the timestamp stuff, because Replicator is not
multi-master, so there's no conflict resolution to take care of.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-08 20:32:41
Message-ID:	200702082032.l18KWfr01501@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> > > Is this a new policy that after discussion, all patches must be
> > > resubmitted with a summary and conclusions of the discussion? I can
> > > certainly do that for you, but just tell me if you are going to ask the
> > > same from everyone.
> >
> > No, I am asking only this time because I feel there was too much
> > disconnect between the patch and the extensive replication discussion
> > that few community members would see the connection.
>
> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> Replicator as well. In fact, it's so obviously useful that I didn't see
> a need to state that in the original thread where it was discussed.

Right, I know it is useful too, but I would like a layout of what it
does and why so everyone is clear on it.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 00:28:28
Message-ID:	45CBC02C.9020308@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> Alvaro Herrera wrote:
>> > > Is this a new policy that after discussion, all patches must be
>> > > resubmitted with a summary and conclusions of the discussion? I can
>> > > certainly do that for you, but just tell me if you are going to ask the
>> > > same from everyone.
>> >
>> > No, I am asking only this time because I feel there was too much
>> > disconnect between the patch and the extensive replication discussion
>> > that few community members would see the connection.
>>
>> FYI, in my opinion the trigger addition is clearly useful to Mammoth
>> Replicator as well. In fact, it's so obviously useful that I didn't see
>> a need to state that in the original thread where it was discussed.
>
> Right, I know it is useful too, but I would like a layout of what it
> does and why so everyone is clear on it.

I have no clue what got you into what you are doing here. But that shall
not be my real concern. If you feel the need to do this sort of thing,
be my guest. I will add the remaining changes to pg_rewrite, including
the new support commands and changes to psql as well as pg_dump and
resubmit the new patch with explanations that will hopefully help you to
comprehend what and how this relatively small and fully backward
compatible change in the trigger and rule firing mechanism will work and
what existing problems it will solve.

Regards,
Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 00:36:20
Message-ID:	200702090036.l190aKj23355@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> > Alvaro Herrera wrote:
> >> > > Is this a new policy that after discussion, all patches must be
> >> > > resubmitted with a summary and conclusions of the discussion? I can
> >> > > certainly do that for you, but just tell me if you are going to ask the
> >> > > same from everyone.
> >> >
> >> > No, I am asking only this time because I feel there was too much
> >> > disconnect between the patch and the extensive replication discussion
> >> > that few community members would see the connection.
> >>
> >> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> >> Replicator as well. In fact, it's so obviously useful that I didn't see
> >> a need to state that in the original thread where it was discussed.
> >
> > Right, I know it is useful too, but I would like a layout of what it
> > does and why so everyone is clear on it.
>
> I have no clue what got you into what you are doing here. But that shall
> not be my real concern. If you feel the need to do this sort of thing,
> be my guest. I will add the remaining changes to pg_rewrite, including
> the new support commands and changes to psql as well as pg_dump and
> resubmit the new patch with explanations that will hopefully help you to
> comprehend what and how this relatively small and fully backward
> compatible change in the trigger and rule firing mechanism will work and
> what existing problems it will solve.

Yep, that's what I want everyone to see. This is standard procedure for
everyone in the community, and core is not immune.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 03:44:02
Message-ID:	45CBEE02.60909@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Well how deep are we talking here? My understanding of what Jan wants to
do is simple.

Be able to declare which triggers are fired depending on the state of
the cluster.

In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
or Slave.

This is useful because I may have a trigger on the Master and the same
trigger on the Slave. You do not want the trigger to fire on the Slave
because we are doing data replication. In short, the we replicate the
result, not the action.

However, you may want triggers that are on the Slave to fire separately.
A reporting server that generates materialized views is a good example.
Don't tie up the Master with what a Slave can do.

Sincerely,

Joshua D. Drake

>
> I have no clue what got you into what you are doing here. But that shall
> not be my real concern. If you feel the need to do this sort of thing,
> be my guest. I will add the remaining changes to pg_rewrite, including
> the new support commands and changes to psql as well as pg_dump and
> resubmit the new patch with explanations that will hopefully help you to
> comprehend what and how this relatively small and fully backward
> compatible change in the trigger and rule firing mechanism will work and
> what existing problems it will solve.
>
>
> Regards,
> Jan
>

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Troy <rtroy(at)sciencetools(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 03:49:52
Message-ID:	200702090349.l193nqv01986@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I just want an outline of what each option is supposed to control. If
that information is in a documentation patch, then fine, he can just
post that and tell people to read the patch documentation.

---------------------------------------------------------------------------

Joshua D. Drake wrote:
> Jan Wieck wrote:
> > On 2/8/2007 3:32 PM, Bruce Momjian wrote:
> >> Alvaro Herrera wrote:
> >>> > > Is this a new policy that after discussion, all patches must be >
> >>> > resubmitted with a summary and conclusions of the discussion? I can
> >>> > > certainly do that for you, but just tell me if you are going to
> >>> ask the > > same from everyone.
> >>> > > No, I am asking only this time because I feel there was too much
> >>> > disconnect between the patch and the extensive replication discussion
> >>> > that few community members would see the connection.
> >>>
> >>> FYI, in my opinion the trigger addition is clearly useful to Mammoth
> >>> Replicator as well. In fact, it's so obviously useful that I didn't see
> >>> a need to state that in the original thread where it was discussed.
> >>
> >> Right, I know it is useful too, but I would like a layout of what it
> >> does and why so everyone is clear on it.
>
> Well how deep are we talking here? My understanding of what Jan wants to
> do is simple.
>
> Be able to declare which triggers are fired depending on the state of
> the cluster.
>
> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
> or Slave.
>
> This is useful because I may have a trigger on the Master and the same
> trigger on the Slave. You do not want the trigger to fire on the Slave
> because we are doing data replication. In short, the we replicate the
> result, not the action.
>
> However, you may want triggers that are on the Slave to fire separately.
> A reporting server that generates materialized views is a good example.
> Don't tie up the Master with what a Slave can do.
>
> Sincerely,
>
> Joshua D. Drake
>
>
>
> >
> > I have no clue what got you into what you are doing here. But that shall
> > not be my real concern. If you feel the need to do this sort of thing,
> > be my guest. I will add the remaining changes to pg_rewrite, including
> > the new support commands and changes to psql as well as pg_dump and
> > resubmit the new patch with explanations that will hopefully help you to
> > comprehend what and how this relatively small and fully backward
> > compatible change in the trigger and rule firing mechanism will work and
> > what existing problems it will solve.
> >
> >
> > Regards,
> > Jan
> >
>
>
> --
>
> === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
> Providing the most comprehensive PostgreSQL solutions since 1997
> http://www.commandprompt.com/
>
> Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
> PostgreSQL Replication: http://www.commandprompt.com/products/
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 04:41:11
Message-ID:	Pine.LNX.4.33.0702082034370.13334-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 8 Feb 2007, Joshua D. Drake wrote:
>
> Well how deep are we talking here? My understanding of what Jan wants to
> do is simple.
>
> Be able to declare which triggers are fired depending on the state of
> the cluster.
>
> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
> or Slave.
>
> This is useful because I may have a trigger on the Master and the same
> trigger on the Slave. You do not want the trigger to fire on the Slave
> because we are doing data replication. In short, the we replicate the
> result, not the action.
>
> However, you may want triggers that are on the Slave to fire separately.
> A reporting server that generates materialized views is a good example.
> Don't tie up the Master with what a Slave can do.
>

It'd be great if Jan considers the blending of replication; any given DB
instance shouldn't be only a master/originator or only a slave/subscriber.
A solution that lets you blend replication strategies in a single db is,
from my point of view, very important.

> > I have no clue what got you into what you are doing here.

Jan, some sleep now and then might be helpful to your public disposition.
-smile-

peace,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Richard Troy <rtroy(at)sciencetools(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 05:38:37
Message-ID:	45CC08DD.6000302@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/8/2007 11:41 PM, Richard Troy wrote:
> On Thu, 8 Feb 2007, Joshua D. Drake wrote:
>>
>> Well how deep are we talking here? My understanding of what Jan wants to
>> do is simple.
>>
>> Be able to declare which triggers are fired depending on the state of
>> the cluster.
>>
>> In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
>> or Slave.
>>
>> This is useful because I may have a trigger on the Master and the same
>> trigger on the Slave. You do not want the trigger to fire on the Slave
>> because we are doing data replication. In short, the we replicate the
>> result, not the action.
>>
>> However, you may want triggers that are on the Slave to fire separately.
>> A reporting server that generates materialized views is a good example.
>> Don't tie up the Master with what a Slave can do.
>>
>
> It'd be great if Jan considers the blending of replication;

Please elaborate. I would really like to get all you can contribute.

>
>> > I have no clue what got you into what you are doing here.
>
> Jan, some sleep now and then might be helpful to your public disposition.

Richard, don't embarrass Bruce. He doesn't need your help.

I have been with this project and know Bruce Momjian for more than 10
years. Every now and then, Bruce and I get into some sort of eventually
publicly visible dispute that doesn't really mean much. I'll probably
spend next Christmas with him and his family again, play a few rounds of
backgammon with Wilma (who I really owe a revenge), hopefully don't
interfere too much with Christine's work (especially when it involves
handling food over a white carpet) and none of us will even remember
this crap. Our friendship has been through some real tests. Any real
problem we would have, we'd never discuss here. We would just meet and
talk.

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	José Orlando Pereira <jop(at)lsd(dot)di(dot)uminho(dot)pt>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alfranio Correia <alfranio(at)lsd(dot)di(dot)uminho(dot)pt>, Bruce Momjian <bruce(at)momjian(dot)us>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 06:04:13
Message-ID:	45CC0EDD.2020900@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/7/2007 7:13 AM, José Orlando Pereira wrote:
> On Saturday 03 February 2007, Bruce Momjian wrote:
>> Jan Wieck wrote:
>> > I don't have any such paper and the proof of concept will be the
>> > implementation of the system. I do however see enough resistance against
>> > this proposal to withdraw the commit timestamp at this time. The new
>> > replication system will therefore require the installation of a patched,
>> > non-standard PostgreSQL version, compiled from sources cluster wide in
>> > order to be used. I am aware that this will dramatically reduce it's
>> > popularity but it is impossible to develop this essential feature as an
>> > external module.
>> >
>> > I thank everyone for their attention.
>>
>> Going and working on it on your own doesn't seem like the proper
>> solution. I don't see people objecting to adding it, but they want it
>> work, which I am sure you want too. You have to show how it will work
>> and convince others of that, and then you have a higher chance it will
>> work, and be in the PostgreSQL codebase.
>
> Hi,
>
> Would it be possible to solve the problem using the GORDA on-commit hook?
>
> Jan would be able reliably obtain a commit timestamp with the desired
> semantics and store it in a regular table within transaction boundaries.

I am not sure, I would have to look at what exactly that hook provides.
The key to a Lamport timestamp is that it is advancing it commit order
(plus some other things ... of course). If the hook can guarantee that
the calls are made always in commit order, serialized without any race
condition possible, it would probably be suitable.

Jan

>
> PostgreSQL would not have to commit to a specific timestamp semantics and the
> patch is quite small.
>
> Regards,
>

From:	Kris Jurka <books(at)ejurka(dot)com>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 06:24:08
Message-ID:	45CC1388.3040206@ejurka.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Richard Troy wrote:

> It'd be great if Jan considers the blending of replication; any given DB
> instance shouldn't be only a master/originator or only a slave/subscriber.
> A solution that lets you blend replication strategies in a single db is,
> from my point of view, very important.
>

Perhaps if more people read Jan's posts he wouldn't be so frustrated.

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php

He clearly describes that the master/slave setting is per session, not
per database.

Kris Jurka

From:	"J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 07:17:19
Message-ID:	2EF9A1CC-C878-4A9D-AC5E-072846CD9F9F@neopolitan.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 8, 2007, at 8:41 PM, Richard Troy wrote:
> It'd be great if Jan considers the blending of replication; any
> given DB
> instance shouldn't be only a master/originator or only a slave/
> subscriber.
> A solution that lets you blend replication strategies in a single
> db is,
> from my point of view, very important.

It might be constructive to define what a minimal "complete" set of
replication primitives actually is in addition to which ones should
be implemented. In addition to master/slave models, you have Paxos
algorithms and dynamic reconfiguration models in literature that can
utilize many of the same primitives but which are very different in
implementation. I see the value of Jan's proposal, but perhaps it
would be better to step back and make some assertions about the
nature of the core capabilities that will be supported in some
broader picture. Having a theoretically (mostly) complete set of
usable primitives would be an incredibly powerful feature set.

Cheers,

J. Andrew Rogers
jrogers(at)neopolitan(dot)com

From:	"Andrew Hammond" <andrew(dot)george(dot)hammond(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Jan Wieck" <JanWieck(at)yahoo(dot)com>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 19:19:41
Message-ID:	1171048781.114081.207840@v33g2000cwv.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Feb 7, 8:12 pm, b(dot)(dot)(dot)(at)momjian(dot)us (Bruce Momjian) wrote:
> Jan Wieck wrote:
> > On 2/7/2007 10:35 PM, Bruce Momjian wrote:
> > > I find the term "logical proof of it's correctness" too restrictive. It
> > > sounds like some formal academic process that really doesn't work well
> > > for us.
>
> > Thank you.

My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.
However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.
1) clock drift rate between disconnected clusters
2) disconnection time
3) transaction rate on the tables, or even rows involved
There are probably more. I think that if Jan implements what he's
described then a very interesting follow-up would be to do the
statistical analysis necessary to quantify the risk of incorrect
ordering while disconnected. (I've got x ms/ms relative clock drift,
and y tps. How long can I run disconnected before falling under
99.999% probability of correctly ordered transactions?)

> No, I _now_ understand the use case, but when the patch was posted, the
> use case was missing. I would like to see a repost with the patch, and
> a description of its use so we can all move forward on that.

An additional use case for an on-commit timestamp is in the analysis
of billing transactions in highly concurrent systems. For example,
imagine your billing period is monthly and you have transactions which
start before and end after the "end-of-month". Having the on-commit
timestamp for these transactions may help when attempting to reconcile
between transactions and account activities.

Andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 19:23:03
Message-ID:	45CCCA17.4060604@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Richard Troy wrote:
> In more specific terms, and I'm just brainstorming in public here, perhaps
> we can use the power of Schemas within a database to manage such
> divisions; commands which pertain to replication can/would include a
> schema specifier and elements within the schema can be replicated one way
> or another, at the whim of the DBA / Architect. For backwards
> compatability, if a schema isn't specified, it indicates that command
> pertains to the entire database.
>
>

I understand that you're just thinking aloud, but overloading namespaces
in this way strikes me as awful. Applications and extensions, which are
the things that have need of namespaces, should not have to care about
replication. If we have to design them for replication we'll be on a
fast track to nowhere IMNSHO.

cheers

andrew

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 19:27:39
Message-ID:	Pine.LNX.4.33.0702091100280.13334-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 9 Feb 2007, Jan Wieck wrote:
> > [ I wrote ]
> > It'd be great if Jan considers the blending of replication;
>
> Please elaborate. I would really like to get all you can contribute.

Thanks Jan,

prefaced that I really haven't read everything you've written on this (or
what other people are doing, either), and that I've got a terrible flu
right now (fever, etc), I'll give it a go - hopefully it's actually
helpful. To wit:

In general terms, "blending of replication [techniques]" means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

At the very least, a schema division strategy for replication leaverages
an existing DB-component binding/dividing mechanism that most everyone is
familliar with. While there are/may be database-wide, nay, installation-
wide constructs as in your Commit Timestamp proposal, I don't see that
there's any conflict - at least, from what I understand of existing
systems and proposals to date.

HTH,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:01:23
Message-ID:	45CCD313.8040306@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/9/2007 2:27 PM, Richard Troy wrote:
> In general terms, "blending of replication [techniques]" means to me that
> one can have a single database instance serve as a master and as a slave
> (to use only one set of terminology), and as a multi-master, too, all
> simultaneously, letting the DBA / Architect choose which portions serve
> which roles (purposes). All replication features would respect the
> boundaries of such choices automatically, as it's all blended.

That is specifically what the changes to pg_trigger and pg_rewrite take
into account. However much you blend different techniques, a single
transaction on one server will always fall into one of three categories.
1) It could be the original operation done by the client application. B)
It could be the actions performed by the replication engine to replay a
remote transaction. And iii) it can be an administrative operation that
requires not to be propagated at all.

No matter how many different models you have in parallel, one single
transaction will be either a master, a slave or an isolated local thing.
The proposed changes allow to tell the session which of these three
roles it is playing and the triggers and rules can be configured to fire
during master/local role, slave role, always or never. That
functionality will work for master-slave as well as multi-master.

Although my current plan isn't creating such a blended system, the
proposed trigger and rule changes are designed to support exactly that
in a 100% backward compatible way.

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Andrew Hammond <andrew(dot)george(dot)hammond(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:16:42
Message-ID:	45CCD6AA.2090409@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/9/2007 2:19 PM, Andrew Hammond wrote:
> On Feb 7, 8:12 pm, b(dot)(dot)(dot)(at)momjian(dot)us (Bruce Momjian) wrote:
>> Jan Wieck wrote:
>> > On 2/7/2007 10:35 PM, Bruce Momjian wrote:
>> > > I find the term "logical proof of it's correctness" too restrictive. It
>> > > sounds like some formal academic process that really doesn't work well
>> > > for us.
>>
>> > Thank you.
>
> My intuition is that it might be possible to prove that _nothing_ can
> provide guaranteed ordering when there is disconnected operation.

As a matter of physics, for two events happening outside of the event
horizon of each other, the question which happened first is pointless.

> However, I think that the clock based ordering Jan has described could
> provide _probable_ ordering under disconnected operation. I can see
> three variables in the equation that would determine the probability
> of correctness for the ordering.

That precisely is the intended functionality. And I can exactly describe
when two conflicting actions will result in the "wrong" row to persist.
This will happen when the second update to the logically same row will
be performed on the server with the Lamport timestamp lagging behind by
more than the time between the two conflicting commits. Example: User
fills out a form, submits, hits back button, corrects input and submits
again within 3 seconds. Load balancing sends both requests to different
servers and the first server is 3.0001 seconds ahead ... the users typo
will be the winner.

My Lamport timestamp conflict resolution will not be able to solve this
problem. However, when this happens, one thing is guaranteed. The update
from the second server, arriving on the first for replication will be
ignored because a locally generated row is newer. This fact can be used
as an indicator that there is a possible conflict that was resolved
using the wrong data (business process wise). All nodes in the cluster
will end up using the same wrong row, so at least they are consistently
wrong. Nevertheless, being able to identify possible problem cases this
way will allow to initiate further action including but not limited to
human intervention.

If this is not an acceptable risk for the application, other resolution
methods will be needed. But I think in many cases, this form of default
resolution will be "good enough".

Jan

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:20:55
Message-ID:	Pine.LNX.4.33.0702091205080.13334-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
> Richard Troy wrote:
> > In more specific terms, and I'm just brainstorming in public here, perhaps
> > we can use the power of Schemas within a database to manage such
> > divisions; commands which pertain to replication can/would include a
> > schema specifier and elements within the schema can be replicated one way
> > or another, at the whim of the DBA / Architect. For backwards
> > compatability, if a schema isn't specified, it indicates that command
> > pertains to the entire database.
>
> I understand that you're just thinking aloud, but overloading namespaces
> in this way strikes me as awful. Applications and extensions, which are
> the things that have need of namespaces, should not have to care about
> replication. If we have to design them for replication we'll be on a
> fast track to nowhere IMNSHO.

Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around. Either way,
claiming "need" like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
"care?" And, what do you mean by "design them for replication?"

While I'm in no way stuck on blending replication strategies via schemas,
it does strike me as an appropriate concept and I'd preferr to have it
evaluated based on technical merrit - possibly citing workarounds or
solutions to technical issues, which is what I gather has been the
tradition of this group: Use case first, technical merrit second... Other
alternatives, ISTM, will have virtually the same look/feel as a schema
from an external perspective, and the more I think of it the more I think
using schemas is a sound, clean approach. That it offends someones sense
of asthetics STM a poor rationale for not choosing it. Another question
might be: What's lacking in the implementation of schemas that makes this
a poor choice, and what could be done about it without much effort?

Regards,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Richard Troy <rtroy(at)ScienceTools(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:25:06
Message-ID:	Pine.LNX.4.33.0702091223150.13334-100000@denzel.in
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 9 Feb 2007, Jan Wieck wrote:
>
> No matter how many different models you have in parallel, one single
> transaction will be either a master, a slave or an isolated local thing.
> The proposed changes allow to tell the session which of these three
> roles it is playing and the triggers and rules can be configured to fire
> during master/local role, slave role, always or never. That
> functionality will work for master-slave as well as multi-master.
>
> Although my current plan isn't creating such a blended system, the
> proposed trigger and rule changes are designed to support exactly that
> in a 100% backward compatible way.
>
> Jan

Fantastic! ...At some point you'll be thinking of the management end -
turning it on or off, etc. That might be where these other points come
more into play.

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(at)ScienceTools(dot)com, http://ScienceTools.com/

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:25:27
Message-ID:	45CCD8B7.9090504@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Richard Troy wrote:
> On Fri, 9 Feb 2007, Andrew Dunstan wrote:
>
>> Richard Troy wrote:
>>
>>> In more specific terms, and I'm just brainstorming in public here, perhaps
>>> we can use the power of Schemas within a database to manage such
>>> divisions; commands which pertain to replication can/would include a
>>> schema specifier and elements within the schema can be replicated one way
>>> or another, at the whim of the DBA / Architect. For backwards
>>> compatability, if a schema isn't specified, it indicates that command
>>> pertains to the entire database.
>>>
>> I understand that you're just thinking aloud, but overloading namespaces
>> in this way strikes me as awful. Applications and extensions, which are
>> the things that have need of namespaces, should not have to care about
>> replication. If we have to design them for replication we'll be on a
>> fast track to nowhere IMNSHO.
>>
>
> Well, Andrew, replication _is_ an application. Or, you could think of
> replication as an extension to an application.

No, I don't think of it as either. It's a utility, more an extension of
the DBMS than of the application. You don't replicate for the sake of
replicating.

> I was under the impression
> that_users_ decide to put tables in schema spaces based upon _user_ need,
> and that Postgres developer's use of them for other purposes was
> incroaching on user choices, not the other way around.

That's exactly what you would be doing with this proposal, encroaching
on what I regard as user space.

> Either way,
> claiming "need" like this strikes me as stuck-in-a-rut or dogmatic
> thinking. Besides, don't we have schema nesting to help resolve any such
> "care?"

No. We do now have schema nesting, for this or any other purpose. Where
did you get that idea? If we did I would not be so resistant to using
them for this purpose, but as it is, if you hijack schemas for
replication segregation you will detract from their more obvious use in
name segregation.

cheers

andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Richard Troy <rtroy(at)ScienceTools(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 20:37:02
Message-ID:	45CCDB6E.8060309@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> We do now have schema nesting, for this or any other purpose.

s/now/not/ (of course)

cheers

andrew

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Richard Troy <rtroy(at)ScienceTools(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-09 22:49:07
Message-ID:	45CCFA63.5090502@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/9/2007 3:25 PM, Andrew Dunstan wrote:
> Richard Troy wrote:
>> On Fri, 9 Feb 2007, Andrew Dunstan wrote:
>>
>>> Richard Troy wrote:
>>>
>>>> In more specific terms, and I'm just brainstorming in public here, perhaps
>>>> we can use the power of Schemas within a database to manage such
>>>> divisions; commands which pertain to replication can/would include a
>>>> schema specifier and elements within the schema can be replicated one way
>>>> or another, at the whim of the DBA / Architect. For backwards
>>>> compatability, if a schema isn't specified, it indicates that command
>>>> pertains to the entire database.
>>>>
>>> I understand that you're just thinking aloud, but overloading namespaces
>>> in this way strikes me as awful. Applications and extensions, which are
>>> the things that have need of namespaces, should not have to care about
>>> replication. If we have to design them for replication we'll be on a
>>> fast track to nowhere IMNSHO.
>>>
>>
>> Well, Andrew, replication _is_ an application. Or, you could think of
>> replication as an extension to an application.
>
> No, I don't think of it as either. It's a utility, more an extension of
> the DBMS than of the application. You don't replicate for the sake of
> replicating.
>
>> I was under the impression
>> that_users_ decide to put tables in schema spaces based upon _user_ need,
>> and that Postgres developer's use of them for other purposes was
>> incroaching on user choices, not the other way around.
>
> That's exactly what you would be doing with this proposal, encroaching
> on what I regard as user space.

I'd never use a schema for that. Look at Slony-I. It uses a user
selectable schema for the stuff it needs to store in the database, to
keep Slony objects separate from user objects. But it organizes the user
tables in what is called sets. A set can be any combination of tables
and sequences from any number of namespaces. If I would do it again, I
would use names instead of numbers to identify sets, and I would allow
for summary sets containing groups of simple sets.

On the other issue, replication is part of the overall infrastructure
and thereby part of the solution to a business problem. Like the
hardware, database and application it solves a piece of the puzzle, none
of the other is particularly good at. It will perform better or
eventually not at all, depending on how much consideration the
distributed nature of the business model was given when the application
was designed. Applications that "happened by accident" rather than being
designed usually don't allow any kind of distribution.

And yes, there are people who replicate for the sake of it. It is the
kind of people who ask for sync multi-master no matter what their actual
problem might be, because it sounds most prestigious.

Jan

From:	José Orlando Pereira <jop(at)lsd(dot)di(dot)uminho(dot)pt>
To:	Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alfranio Correia <alfranio(at)lsd(dot)di(dot)uminho(dot)pt>, Bruce Momjian <bruce(at)momjian(dot)us>, Theo Schlossnagle <jesus(at)omniti(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Proposal: Commit timestamp
Date:	2007-02-21 16:26:32
Message-ID:	200702211626.33673.jop@lsd.di.uminho.pt
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 09 February 2007, Jan Wieck wrote:
> I am not sure, I would have to look at what exactly that hook provides.
> The key to a Lamport timestamp is that it is advancing it commit order
> (plus some other things ... of course). If the hook can guarantee that
> the calls are made always in commit order, serialized without any race
> condition possible, it would probably be suitable.

Actually what we do is a bit stronger. We use the commit hook to enforce an
externally defined commit order. In our case, this is defined by a group
communication protocol, which is even allowed to reorder a pair of
transactions originating from the same replica. Therefore, achieving a commit
order that is consistent with a local clock should be straightforward.

Regards,

--
Jose Orlando Pereira