Exposing the Xact commit order to the user

Lists: pgsql-hackers
From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Exposing the Xact commit order to the user
Date: 2010-05-23 20:21:58
Message-ID: 4BF98E66.9000703@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In some systems (data warehousing, replication), the order of commits is
important, since that is the order in which changes have become visible.
This information could theoretically be extracted from the WAL, but
scanning the entire WAL just to extract this tidbit of information would
be excruciatingly painful.

The following is based on ideas that emerged during last weeks PGCon.
Consider it an implementation proposal, if you like.

We introduce a new set of files. The files represent segments of an
infinite array of structures. The present segments are the available
"window" of data. Similar to CLOG files, the individual file name will
represent the high bits of a "serial" number, the offset of the record
inside the file represents the low bits of the "serial".

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

Each record of the Transaction Commit Info consists of

txid xci_transaction_id
timestamptz xci_begin_timestamp
timestamptz xci_commit_timestamp
int64 xci_total_rowcount

32 bytes total.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.
The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
commit_timestamp is what CommitTransaction() just decided to write into
the WAL commit record and the total_rowcount is the sum of inserted,
updated and deleted heap tuples during the transaction, which should be
easily available from the statistics collector, unless row stats are
disabled, in which case the datum would be zero.

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are
outside the postgresql.conf configuration with respect to maximum age
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.

Comments, suggestions?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-23 20:48:19
Message-ID: 95a6a1132b8673c6d0cb6cab25fe441e@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.

In light of the proposed purging scheme, how would it be able to distinguish
between those two cases (nothing there yet vs. was there but purged)?

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005231646
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv5lIAACgkQvJuQZxSWSsiR3gCgvyK/NPd6WmKGUqdo/3fdWIR7
LAQAoJqk3gYpEgtjw10gINDKFXTAnWO5
=sSvK
-----END PGP SIGNATURE-----


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 00:38:14
Message-ID: AANLkTinwZYcmTNpu-v8sbLyqPR3EKfNGTBeWagdPx7Od@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> The system will have postgresql.conf options for enabling/disabling the
> whole shebang, how many shared buffers to allocate for managing access
> to the data and to define the retention period of the data based on data
> volume and/or age of the commit records.

It would be nice if this could just be managed out of shared_buffers
rather than needing to configure a separate pool just for this
feature. But, I'm not sure how much work that is, and if it turns out
to be too ugly then I'd say it's not a hard requirement. In general,
I think we talked during the meeting about the desirability of folding
specific pools into shared_buffers rather than managing them
separately, but I'm not aware that we have any cases where we do that
today so it might be hard (or not).

> Each record of the Transaction Commit Info consists of
>
>     txid          xci_transaction_id
>     timestamptz   xci_begin_timestamp
>     timestamptz   xci_commit_timestamp
>     int64         xci_total_rowcount
>
> 32 bytes total.

Are we sure it's worth including the row count? I wonder if we ought
to leave that out and let individual clients of the mechanism track
that if they're so inclined, especially since it won't be reliable
anyway.

> CommitTransaction() inside of xact.c will call a function, that inserts
> a new record into this array. The operation will for most of the time be
> nothing than taking a spinlock and adding the record to shared memory.
> All the data for the record is readily available, does not require
> further locking and can be collected locally before taking the spinlock.

What happens when you need to switch pages?

> The function will return the "sequence" number which CommitTransaction()
> in turn will record in the WAL commit record together with the
> begin_timestamp. While both, the begin as well as the commit timestamp
> are crucial to determine what data a particular transaction should have
> seen, the row count is not and will not be recorded in WAL.

It would certainly be better if we didn't to bloat the commit xlog
records to do this. Is there any way to avoid that?

> Checkpoint handling will call a function to flush the shared buffers.
> Together with this, the information from WAL records will be sufficient
> to recover this data (except for row counts) during crash recovery.

Right.

> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.
>
> Purging of the data will be possible in several different ways.
> Autovacuum will call a function that drops segments of the data that are
>  outside the postgresql.conf configuration with respect to maximum age
> or data volume. There will also be a function reserved for superusers to
> explicitly purge the data up to a certain serial number.

Dunno if autovacuuming this is the right way to go. Seems like that
could leave to replication breaks, and it's also more work than not
doing that. I'd just say that if you turn this on you're responsible
for pruning it, full stop.

> Anyone who trades liberty for security deserves neither
> liberty nor security. -- Benjamin Franklin

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 01:18:39
Message-ID: 4BF9D3EF.7000902@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/23/2010 4:48 PM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
>
>> Exposing the data will be done via a set returning function. The SRF
>> takes two arguments. The maximum number of rows to return and the last
>> serial number processed by the reader. The advantage of such SRF is that
>> the result can be used in a query that right away delivers audit or
>> replication log information in transaction commit order. The SRF can
>> return an empty set if no further transactions have committed since, or
>> an error if data segments needed to answer the request have already been
>> purged.
>
> In light of the proposed purging scheme, how would it be able to distinguish
> between those two cases (nothing there yet vs. was there but purged)?

There is a difference between an empty result set and an exception.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 01:44:27
Message-ID: 4BF9D9FB.7020606@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/23/2010 8:38 PM, Robert Haas wrote:
> On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> The system will have postgresql.conf options for enabling/disabling the
>> whole shebang, how many shared buffers to allocate for managing access
>> to the data and to define the retention period of the data based on data
>> volume and/or age of the commit records.
>
> It would be nice if this could just be managed out of shared_buffers
> rather than needing to configure a separate pool just for this
> feature. But, I'm not sure how much work that is, and if it turns out
> to be too ugly then I'd say it's not a hard requirement. In general,
> I think we talked during the meeting about the desirability of folding
> specific pools into shared_buffers rather than managing them
> separately, but I'm not aware that we have any cases where we do that
> today so it might be hard (or not).

I'm not sure the retention policies of the shared buffer cache, the WAL
buffers, CLOG buffers and every other thing we try to cache are that
easy to fold into one single set of logic. But I'm all ears.

>
>> Each record of the Transaction Commit Info consists of
>>
>> txid xci_transaction_id
>> timestamptz xci_begin_timestamp
>> timestamptz xci_commit_timestamp
>> int64 xci_total_rowcount
>>
>> 32 bytes total.
>
> Are we sure it's worth including the row count? I wonder if we ought
> to leave that out and let individual clients of the mechanism track
> that if they're so inclined, especially since it won't be reliable
> anyway.

Nope, we (my belly and I) are not sure about the absolute worth of the
row count. It would be a convenient number to have there, but I can live
without it.

>
>> CommitTransaction() inside of xact.c will call a function, that inserts
>> a new record into this array. The operation will for most of the time be
>> nothing than taking a spinlock and adding the record to shared memory.
>> All the data for the record is readily available, does not require
>> further locking and can be collected locally before taking the spinlock.
>
> What happens when you need to switch pages?

Then the code will have to grab another free buffer or evict one.

>
>> The function will return the "sequence" number which CommitTransaction()
>> in turn will record in the WAL commit record together with the
>> begin_timestamp. While both, the begin as well as the commit timestamp
>> are crucial to determine what data a particular transaction should have
>> seen, the row count is not and will not be recorded in WAL.
>
> It would certainly be better if we didn't to bloat the commit xlog
> records to do this. Is there any way to avoid that?

If you can tell me how a crash recovering system can figure out what the
exact "sequence" number of the WAL commit record at hand should be,
let's rip it.

>
>> Checkpoint handling will call a function to flush the shared buffers.
>> Together with this, the information from WAL records will be sufficient
>> to recover this data (except for row counts) during crash recovery.
>
> Right.
>
>> Exposing the data will be done via a set returning function. The SRF
>> takes two arguments. The maximum number of rows to return and the last
>> serial number processed by the reader. The advantage of such SRF is that
>> the result can be used in a query that right away delivers audit or
>> replication log information in transaction commit order. The SRF can
>> return an empty set if no further transactions have committed since, or
>> an error if data segments needed to answer the request have already been
>> purged.
>>
>> Purging of the data will be possible in several different ways.
>> Autovacuum will call a function that drops segments of the data that are
>> outside the postgresql.conf configuration with respect to maximum age
>> or data volume. There will also be a function reserved for superusers to
>> explicitly purge the data up to a certain serial number.
>
> Dunno if autovacuuming this is the right way to go. Seems like that
> could leave to replication breaks, and it's also more work than not
> doing that. I'd just say that if you turn this on you're responsible
> for pruning it, full stop.

It is an option. "Keep it until I tell you" is a perfectly valid
configuration option. One you probably don't want to forget about, but
valid none the less.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 13:30:01
Message-ID: fca84a02124e1ffc7879dd0e5c7c2c7e@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

> In light of the proposed purging scheme, how would it be able to distinguish
> between those two cases (nothing there yet vs. was there but purged)?

> There is a difference between an empty result set and an exception.

No, I meant how will the *function* know, if a superuser and/or some
background process can purge records at any time?

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005240928
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkv6f0UACgkQvJuQZxSWSsh0xwCgmXLtKngoBBYX0TxDM2TlJRId
AVIAoMHYa3c9Ej2vUJyFufxBR5vDPzQ+
=e1mh
-----END PGP SIGNATURE-----


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 13:37:38
Message-ID: AANLkTilZT1Z4YsWxjgxEptruaT3xWEaSJaJi6iO8s9MQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 23, 2010 at 9:44 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> I'm not sure the retention policies of the shared buffer cache, the WAL
> buffers, CLOG buffers and every other thing we try to cache are that easy to
> fold into one single set of logic. But I'm all ears.

I'm not sure either, although it seems like LRU ought to be good
enough for most things. I'm more worried about things like whether
the BufferDesc abstraction is going to get in the way.

>>> CommitTransaction() inside of xact.c will call a function, that inserts
>>> a new record into this array. The operation will for most of the time be
>>> nothing than taking a spinlock and adding the record to shared memory.
>>> All the data for the record is readily available, does not require
>>> further locking and can be collected locally before taking the spinlock.
>>
>> What happens when you need to switch pages?
>
> Then the code will have to grab another free buffer or evict one.

Hopefully not while holding a spin lock. :-)

>>> The function will return the "sequence" number which CommitTransaction()
>>> in turn will record in the WAL commit record together with the
>>> begin_timestamp. While both, the begin as well as the commit timestamp
>>> are crucial to determine what data a particular transaction should have
>>> seen, the row count is not and will not be recorded in WAL.
>>
>> It would certainly be better if we didn't to bloat the commit xlog
>> records to do this.  Is there any way to avoid that?
>
> If you can tell me how a crash recovering system can figure out what the
> exact "sequence" number of the WAL commit record at hand should be, let's
> rip it.

Hmm... could we get away with WAL-logging the next sequence number
just once per checkpoint? When you replay the checkpoint record, you
update the control file with the sequence number. Then all the
commits up through the next checkpoint just use consecutive numbers
starting at that value.

> It is an option. "Keep it until I tell you" is a perfectly valid
> configuration option. One you probably don't want to forget about, but valid
> none the less.

As Tom is fond of saying, if it breaks, you get to keep both pieces.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jan Wieck <janwieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 19:07:22
Message-ID: 1274727835-sup-4649@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Robert Haas's message of dom may 23 20:38:14 -0400 2010:
> On Sun, May 23, 2010 at 4:21 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> > The system will have postgresql.conf options for enabling/disabling the
> > whole shebang, how many shared buffers to allocate for managing access
> > to the data and to define the retention period of the data based on data
> > volume and/or age of the commit records.
>
> It would be nice if this could just be managed out of shared_buffers
> rather than needing to configure a separate pool just for this
> feature.

FWIW we've talked about this for years -- see old discussions about how
pg_subtrans becomes a bottleneck in certain cases and you want to
enlarge the number of buffers allocated to it (probably easy to find by
searching posts from Jignesh). I'm guessing the new notify code would
benefit from this as well.

It'd be nice to have as a side effect, but if not, IMHO this proposal
could simply use a fixed buffer pool like all other slru.c callers until
someone gets around to fixing that. Adding more GUC switches for this
strikes me as overkill.

--
Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>


From: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 19:10:21
Message-ID: 20100524191021.GA53044@csail.mit.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 23, 2010 at 04:21:58PM -0400, Jan Wieck wrote:
> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

This is very interesting to me as I've been doing some (research --
nowhere near production-level) work on building a transactional
application-level (i.e. memcached-like) cache atop Postgres. One of the
features I needed to support it was basically what you describe.

Without getting too far into the details of what I'm doing, I needed to
make it clear to a higher layer which commits were visible to a given
query. That is, I wanted to know both the order of commits and where
particular snapshots fit into this ordering. (A SnapshotData struct
obviously contains the visibility information, but a representation in
terms of the commit ordering is both more succinct and allows for easy
ordering comparisons).

Something you might want to consider, then, is adding an interface to
find out the timestamp of the current transaction's snapshot, i.e. the
timestamp of the most recent committed transaction visible to it. I
wouldn't expect this to be difficult to implement as transaction
completion/visibility is already synchronized via ProcArrayLock.

> Each record of the Transaction Commit Info consists of
>
> txid xci_transaction_id
> timestamptz xci_begin_timestamp
> timestamptz xci_commit_timestamp
> int64 xci_total_rowcount

Another piece of information that seems useful to provide here would be
the logical timestamp of the transaction, i.e. a counter that's
incremented by one for each transaction. But maybe that's implicit in
the log ordering?

I'm not clear on why the total rowcount is useful, but perhaps I'm
missing something obvious.

I've actually implemented some semblance of this on Postgres 8.2, but it
sounds like what you're interested in is more sophisticated. In
particular, I wasn't at all concerned with durability or WAL stuff, and
I had some specific requirements about when it was OK to purge the
data. Because of this (and very limited development time), I just
threw something together with a simple shared buffer.

I don't think I have any useful code to offer, but let me know if
there's some way I can help out.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Jan Wieck <janwieck(at)yahoo(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 19:12:09
Message-ID: AANLkTinmiaGBI95OGnBWZUiyivAEeVhSLTZEMwpQwLMT@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, May 24, 2010 at 3:07 PM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> It'd be nice to have as a side effect, but if not, IMHO this proposal
> could simply use a fixed buffer pool like all other slru.c callers until
> someone gets around to fixing that.  Adding more GUC switches for this
> strikes me as overkill.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-24 21:53:50
Message-ID: 4BFAF56E.3090005@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/24/2010 3:10 PM, Dan Ports wrote:
> I'm not clear on why the total rowcount is useful, but perhaps I'm
> missing something obvious.

It is a glimpse into the future. Several years of pain doing replication
work has taught me that knowing approximately who much work the next
chunk will be "before you select it all" is a really useful thing.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 16:03:25
Message-ID: 1274803405.6203.2008.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:

> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

I think it would be quite simple to read WAL. WALSender reads the WAL
file after its been flushed, so it would be simple for it to read a blob
of WAL and then extract the commit order from it.

Overall though, it would be easier and more efficient to *add* info to
WAL and then do all this processing *after* WAL has been transported
elsewhere. Extracting info with DDL triggers, normal triggers, commit
order and everything else seems like too much work to me. Every other
RDBMS has moved away from trigger-based replication and we should give
that serious consideration also.

--
Simon Riggs www.2ndQuadrant.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 19:58:14
Message-ID: 4BFC2BD6.704@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/24/2010 9:30 AM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
>
>> In light of the proposed purging scheme, how would it be able to distinguish
>> between those two cases (nothing there yet vs. was there but purged)?
>
>> There is a difference between an empty result set and an exception.
>
> No, I meant how will the *function* know, if a superuser and/or some
> background process can purge records at any time?

The data contains timestamps which are supposedly taken in commit order.
Checking the age of the last entry in the file should be simple enough
to determine if the segment matches the "max age" configuration (if
set). In the case of a superuser telling what to purge he would just
call a function with a serial number (telling the obsolete segments).

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 20:16:08
Message-ID: 28952.1274818568@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> No, I meant how will the *function* know, if a superuser and/or some
>> background process can purge records at any time?

> The data contains timestamps which are supposedly taken in commit order.

You can *not* rely on the commit timestamps to be in exact order.
(Perhaps approximate ordering is good enough for what you want here,
but just be careful to not fall into the trap of assuming that they're
exactly ordered.)

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 20:41:03
Message-ID: 4BFC35DF.7060103@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/25/2010 12:03 PM, Simon Riggs wrote:
> On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
>
>> In some systems (data warehousing, replication), the order of commits is
>> important, since that is the order in which changes have become visible.
>> This information could theoretically be extracted from the WAL, but
>> scanning the entire WAL just to extract this tidbit of information would
>> be excruciatingly painful.
>
> I think it would be quite simple to read WAL. WALSender reads the WAL
> file after its been flushed, so it would be simple for it to read a blob
> of WAL and then extract the commit order from it.
>
> Overall though, it would be easier and more efficient to *add* info to
> WAL and then do all this processing *after* WAL has been transported
> elsewhere. Extracting info with DDL triggers, normal triggers, commit
> order and everything else seems like too much work to me. Every other
> RDBMS has moved away from trigger-based replication and we should give
> that serious consideration also.

Reading the entire WAL just to find all COMMIT records, then go back to
the origin database to get the actual replication log you're looking for
is simpler and more efficient? I don't think so.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 20:50:21
Message-ID: 1274820621.6203.2815.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote:
> On 5/25/2010 12:03 PM, Simon Riggs wrote:
> > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
> >
> >> In some systems (data warehousing, replication), the order of commits is
> >> important, since that is the order in which changes have become visible.
> >> This information could theoretically be extracted from the WAL, but
> >> scanning the entire WAL just to extract this tidbit of information would
> >> be excruciatingly painful.
> >
> > I think it would be quite simple to read WAL. WALSender reads the WAL
> > file after its been flushed, so it would be simple for it to read a blob
> > of WAL and then extract the commit order from it.
> >
> > Overall though, it would be easier and more efficient to *add* info to
> > WAL and then do all this processing *after* WAL has been transported
> > elsewhere. Extracting info with DDL triggers, normal triggers, commit
> > order and everything else seems like too much work to me. Every other
> > RDBMS has moved away from trigger-based replication and we should give
> > that serious consideration also.
>
> Reading the entire WAL just to find all COMMIT records, then go back to
> the origin database to get the actual replication log you're looking for
> is simpler and more efficient? I don't think so.

Agreed, but I think I've not explained myself well enough.

I proposed two completely separate ideas; the first one was this:

If you must get commit order, get it from WAL on *origin*, using exact
same code that current WALSender provides, plus some logic to read
through the WAL records and extract commit/aborts. That seems much
simpler than the proposal you outlined and as SR shows, its low latency
as well since commits write to WAL. No need to generate event ticks
either, just use XLogRecPtrs as WALSender already does.

I see no problem with integrating that into core, technically or
philosophically.

--
Simon Riggs www.2ndQuadrant.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-25 21:10:05
Message-ID: 4BFC3CAD.8050908@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/25/2010 4:50 PM, Simon Riggs wrote:
> On Tue, 2010-05-25 at 16:41 -0400, Jan Wieck wrote:
>> On 5/25/2010 12:03 PM, Simon Riggs wrote:
>> > On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:
>> >
>> >> In some systems (data warehousing, replication), the order of commits is
>> >> important, since that is the order in which changes have become visible.
>> >> This information could theoretically be extracted from the WAL, but
>> >> scanning the entire WAL just to extract this tidbit of information would
>> >> be excruciatingly painful.
>> >
>> > I think it would be quite simple to read WAL. WALSender reads the WAL
>> > file after its been flushed, so it would be simple for it to read a blob
>> > of WAL and then extract the commit order from it.
>> >
>> > Overall though, it would be easier and more efficient to *add* info to
>> > WAL and then do all this processing *after* WAL has been transported
>> > elsewhere. Extracting info with DDL triggers, normal triggers, commit
>> > order and everything else seems like too much work to me. Every other
>> > RDBMS has moved away from trigger-based replication and we should give
>> > that serious consideration also.
>>
>> Reading the entire WAL just to find all COMMIT records, then go back to
>> the origin database to get the actual replication log you're looking for
>> is simpler and more efficient? I don't think so.
>
> Agreed, but I think I've not explained myself well enough.
>
> I proposed two completely separate ideas; the first one was this:
>
> If you must get commit order, get it from WAL on *origin*, using exact
> same code that current WALSender provides, plus some logic to read
> through the WAL records and extract commit/aborts. That seems much
> simpler than the proposal you outlined and as SR shows, its low latency
> as well since commits write to WAL. No need to generate event ticks
> either, just use XLogRecPtrs as WALSender already does.
>
> I see no problem with integrating that into core, technically or
> philosophically.
>

Which means that if I want to allow a consumer of that commit order data
to go offline for three days or so to replicate the 5 requested, low
volume tables, the origin needs to hang on to the entire WAL log from
all 100 other high volume tables?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Sabino Mullane <greg(at)turnstep(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 02:44:23
Message-ID: 4BFC8B07.6080006@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/25/2010 4:16 PM, Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>>> No, I meant how will the *function* know, if a superuser and/or some
>>> background process can purge records at any time?
>
>> The data contains timestamps which are supposedly taken in commit order.
>
> You can *not* rely on the commit timestamps to be in exact order.
> (Perhaps approximate ordering is good enough for what you want here,
> but just be careful to not fall into the trap of assuming that they're
> exactly ordered.)

I am well aware of the fact that commit timestamps within the WAL can go
backwards and that the serial numbers of this proposed implementation of
commit order can even be different from what the timestamps AND the WAL
are saying.

As long as the serial number (record position inside of segment) is
determined while the transaction still holds all its locks, this is
going to be good enough for what async replication users today are used
to. Again, it will not magically make it possible to determine a
serializable order of actions, that happened from transactions running
in read committed isolation level, post mortem. I don't even even think
that is possible at all.

And I don't think anyone proposed a solution for that problem anyways.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 14:04:21
Message-ID: AANLkTinqlFtvpS-45F354qVSBJ49MHjBRG1ucMc75HZo@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> Each record of the Transaction Commit Info consists of
>
>     txid          xci_transaction_id
>     timestamptz   xci_begin_timestamp
>     timestamptz   xci_commit_timestamp
>     int64         xci_total_rowcount
>

So I think you're going about this backwards.

Instead of discussing implementation I think you should start with the
API the replication system needs. In particular I'm not sure you
really want a server-side query at all. I'm wondering if you wouldn't
be better off with a public machine-parsable text format version of
the WAL. Ie, at the same time as writing out all the nitty gritty to
the binary wal we would write out a summary of public data to an xml
version containing just parts of the data stream that we can promise
won't change, such as transaction id, lsn, timestamp.

--
greg


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 15:11:51
Message-ID: 4BFD3A37.3070104@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 10:04 AM, Greg Stark wrote:
> On Sun, May 23, 2010 at 9:21 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> Each record of the Transaction Commit Info consists of
>>
>> txid xci_transaction_id
>> timestamptz xci_begin_timestamp
>> timestamptz xci_commit_timestamp
>> int64 xci_total_rowcount
>>
>
> So I think you're going about this backward
>
> Instead of discussing implementation I think you should start with the
> API the replication system needs. In particular I'm not sure you
> really want a server-side query at all. I'm wondering if you wouldn't
> be better off with a public machine-parsable text format version of
> the WAL. Ie, at the same time as writing out all the nitty gritty to
> the binary wal we would write out a summary of public data to an xml
> version containing just parts of the data stream that we can promise
> won't change, such as transaction id, lsn, timestamp.

Since the actual row level change information and other event data is
found inside of regular tables, identified by TXID and sequence number,
I am pretty sure I want that data in a server-side query. What you are
proposing is to read the xid's and timestamps with an external process,
that now forcibly needs to reside on the DB server itself (neither
Londiste nor Slony have that requirement as of today), then bring it
back into the DB at least inside the WHERE clause of a query.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Steve Singer <ssinger(at)ca(dot)afilias(dot)info>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 15:43:45
Message-ID: 4BFD41B1.8030803@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:

> Since the actual row level change information and other event data is
> found inside of regular tables, identified by TXID and sequence number,
> I am pretty sure I want that data in a server-side query. What you are
> proposing is to read the xid's and timestamps with an external process,
> that now forcibly needs to reside on the DB server itself (neither
> Londiste nor Slony have that requirement as of today), then bring it
> back into the DB at least inside the WHERE clause of a query.

It depends on how you approach the problem.

If you had a process that could scan WAL files (or a platform/version
independent representation of these WAL files) you could run that
process on any server (the origin server, a replica, or some third
server with the software installed). Where you run it involves making
trade-offs on the costs of storing transferring and processing the files
and would ideally be configurable.

You could then have a process that transfers all of the data logged by
the triggers to the replicas as soon as it is committed. Basically
saying 'copy any rows in sl_log from the origin to the replica that we
haven't already sent to that replica'

You could then move the work of figuring out the commit order onto the
replica where you would combine the output of the WAL scanning process
with the transaction data that has been copied to the replica.

>
>
> Jan
>

--
Steve Singer
Afilias Canada
Data Services Developer
416-673-1142


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Steve Singer <ssinger(at)ca(dot)afilias(dot)info>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 16:03:42
Message-ID: AANLkTim96qA0juGeSLo4qYDrkq3SOknIbqrC4aUtxhX7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 26, 2010 at 11:43 AM, Steve Singer <ssinger(at)ca(dot)afilias(dot)info> wrote:
> Jan Wieck wrote:
>
>> Since the actual row level change information and other event data is
>> found inside of regular tables, identified by TXID and sequence number, I am
>> pretty sure I want that data in a server-side query. What you are proposing
>> is to read the xid's and timestamps with an external process, that now
>> forcibly needs to reside on the DB server itself (neither Londiste nor Slony
>> have that requirement as of today), then bring it back into the DB at least
>> inside the WHERE clause of a query.
>
>
> It depends on how you approach the problem.
>
> If you had a process that could scan WAL files (or a platform/version
> independent representation of these WAL files) you could run that process on
> any server (the origin server, a replica, or some third server with the
> software installed).  Where you run it involves making trade-offs on the
> costs of storing transferring and processing the files and would ideally be
> configurable.
>
> You could then have a process that transfers all of the data logged by the
> triggers to the replicas as soon as it is committed.  Basically saying 'copy
> any rows in sl_log from the origin to the replica that we haven't already
> sent to that replica'
>
> You could then move the work of figuring out the commit order onto the
> replica where you would combine the output of the WAL scanning process with
> the transaction data that has been copied to the replica.

I'm sure it's possible to make this work however you want to do it,
but I don't really see what advantage Greg Stark's proposal has over
Jan's original proposal. Recording the commits in one extra place at
commit time is practically free, especially compared to the overall
cost of replication. Rescanning the WAL seems likely to be much more
expensive and potentially introduces more failure paths.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 16:10:18
Message-ID: 4BFD47EA.7090307@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 10:04 AM, Greg Stark wrote:
> Instead of discussing implementation I think you should start with the
> API the replication system needs.

... but to answer that request, actually I don't even think we should be
discussing API specifics.

During PGCon, Marco Kreen, Jim Nasby and I were discussing what the
requirements of a unified message queue, shared by Londiste and Slony
may look like. For some use cases of pgq, there isn't even any interest
in user table changes. These are simply a reliable, database backed
message passing system.

Today both systems use an "agreeable" order of changes selected by
rather expensive queries based on serializable snapshot information and
a global, non cacheable serial number.

This could be replaced with a logic based on the actual commit order of
the transactions. This order does not need to be 100% accurate. As long
as the order is recorded after all user actions have been performed
(trigger queue shut down) and while the transaction is still holding
onto its locks, that order is good enough. This will not allow a
conflicting transaction, waiting on locks to be released, to appear
having committed before the lock conflict winner.

It is obvious that in cases where only small portions or even none of
the user table changes are needed, holding on to or even parsing the
ENTIRE WAL sounds suboptimal for this use case.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 16:38:51
Message-ID: AANLkTilUumngFqYITjmLUVCP3orXSnX311mthDapXBNr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> ... but to answer that request, actually I don't even think we should be
> discussing API specifics.
>

How about just API generalities? Like, where do you need this data, on
the master or on the slave? Would PGXC like it on the transaction
coordinator?

What question do you need to answer, do you need to pull out sets of
commits in certain ranges or look up specific transaction ids and find
out when they committed? Or do you only need to answer which of two
transaction ids committed first?

--
greg


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 17:17:31
Message-ID: 4BFD57AB.3060306@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Could you generate the commit-order log by simply registering a commit
hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log
somewhere in the data directory? That would work with older versions
too, no server changes required.

It would not get called during recovery, but I believe that would be
sufficient for Slony. You could always batch commits that you don't know
when they committed as if they committed simultaneously.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 18:43:34
Message-ID: 4BFD6BD6.7010001@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
> Could you generate the commit-order log by simply registering a commit
> hook (RegisterXactCallback(XACT_EVENT_COMMIT)) that writes such a log
> somewhere in the data directory? That would work with older versions
> too, no server changes required.
>

That would work, as it seems that the backend keeps holding on to its
locks until after calling the callbacks.

> It would not get called during recovery, but I believe that would be
> sufficient for Slony. You could always batch commits that you don't know
> when they committed as if they committed simultaneously.

Here you are mistaken. If the origin crashes but can recover not yet
flushed to xlog-commit-order transactions, then the consumer has no idea
about the order of those commits, which throws us back to the point
where we require a non cacheable global sequence to replay the
individual actions of those "now batched" transactions in an agreeable
order.

The commit order data needs to be covered by crash recovery.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 19:16:45
Message-ID: 4BFD739D.6010703@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/05/10 21:43, Jan Wieck wrote:
> On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
>> It would not get called during recovery, but I believe that would be
>> sufficient for Slony. You could always batch commits that you don't
>> know when they committed as if they committed simultaneously.
>
> Here you are mistaken. If the origin crashes but can recover not yet
> flushed to xlog-commit-order transactions, then the consumer has no idea
> about the order of those commits, which throws us back to the point
> where we require a non cacheable global sequence to replay the
> individual actions of those "now batched" transactions in an agreeable
> order.
>
> The commit order data needs to be covered by crash recovery.

Perhaps I'm missing something, but I thought that Slony currently uses a
heartbeat, and all transactions committed between two beats are banged
together and committed as one in the slave so that their relative commit
order doesn't matter. Can we not do the same for commits missing from
the commit-order log?

I'm thinking that the commit-order log would contain two kinds of records:

a) Transaction with XID X committed
b) All transactions with XID < X committed

During normal operation we write the 1st kind of record at every commit.
After crash recovery (perhaps at the first commit after recovery or when
the slon daemon first polls the server, as there's no hook for
end-of-recovery), we write the 2nd kind of record.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:11:50
Message-ID: m2iq6a1kzd.fsf@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Perhaps I'm missing something, but I thought that Slony currently uses a
> heartbeat, and all transactions committed between two beats are banged
> together and committed as one in the slave so that their relative commit
> order doesn't matter.

I guess Slony does the same as pgq here: all events of all those
transactions between two given ticks are batched together in the order
of the event commits. (In fact the batches are made at the consumer
request, so possibly spreading more than 2 ticks at a time).

If you skip that event ordering (within transactions), you can't
maintain foreign keys on the slaves, among other things.

The idea of this proposal is to be able to get this commit order
directly from where the information is maintained, rather than use some
sort of user sequence for that.

So even ordering the txid and txid_snapshots with respect to WAL commit
time (LSN) won't be the whole story, for any given transaction
containing more than one event we also need to have them in order. I
know Jan didn't forget about it so it must either be in the proposal or
easily derived, too tired to recheck.

Regards,
--
dim


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:19:42
Message-ID: AANLkTil0w_wT4mAnoKrgcX1nbcGzDYh0AXh7-tB0_1Mu@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 26, 2010 at 4:11 PM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Perhaps I'm missing something, but I thought that Slony currently uses a
>> heartbeat, and all transactions committed between two beats are banged
>> together and committed as one in the slave so that their relative commit
>> order doesn't matter.
>
> I guess Slony does the same as pgq here: all events of all those
> transactions between two given ticks are batched together in the order
> of the event commits. (In fact the batches are made at the consumer
> request, so possibly spreading more than 2 ticks at a time).
>
> If you skip that event ordering (within transactions), you can't
> maintain foreign keys on the slaves, among other things.
>
> The idea of this proposal is to be able to get this commit order
> directly from where the information is maintained, rather than use some
> sort of user sequence for that.

Exactly.

> So even ordering the txid and txid_snapshots with respect to WAL commit
> time (LSN) won't be the whole story, for any given transaction
> containing more than one event we also need to have them in order. I
> know Jan didn't forget about it so it must either be in the proposal or
> easily derived, too tired to recheck.

Right, so the point is - with this proposal, he can switch to using a
LOCAL sequence number to order events within the session and then
order the sessions using the commit ordering. Right now, he has to
use a GLOBAL sequence number because there's no way to know the commit
order.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:32:53
Message-ID: 4BFD8575.30406@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 3:16 PM, Heikki Linnakangas wrote:
> On 26/05/10 21:43, Jan Wieck wrote:
>> On 5/26/2010 1:17 PM, Heikki Linnakangas wrote:
>>> It would not get called during recovery, but I believe that would be
>>> sufficient for Slony. You could always batch commits that you don't
>>> know when they committed as if they committed simultaneously.
>>
>> Here you are mistaken. If the origin crashes but can recover not yet
>> flushed to xlog-commit-order transactions, then the consumer has no idea
>> about the order of those commits, which throws us back to the point
>> where we require a non cacheable global sequence to replay the
>> individual actions of those "now batched" transactions in an agreeable
>> order.
>>
>> The commit order data needs to be covered by crash recovery.
>
> Perhaps I'm missing something,

Apparently, more about that at the end.

> I'm thinking that the commit-order log would contain two kinds of records:
>
> a) Transaction with XID X committed
> b) All transactions with XID < X committed

If that was true then long running transactions would delay all commits
for transactions that started after them. Do they?

>
> During normal operation we write the 1st kind of record at every commit.
> After crash recovery (perhaps at the first commit after recovery or when
> the slon daemon first polls the server, as there's no hook for
> end-of-recovery), we write the 2nd kind of record.

I think the callback is also called during backend startup, which means
that it could record the first XID to come which is known from the
control file and in that case, all < XID's are committed or aborted.

Which leads us to your missing piece above, the need for the global non
cacheable sequence.

Consider two transactions A and B that due to transaction batching
between snapshots get applied together. Let the order of actions be

1. A starts
2. B starts
3. B selects a row for update, then updates the row
4. A tries to do the same and blocks
5. B commits
6. A gets the lock, the row, does the update
7. A commits

If Slony (or Londiste) would not record the exact order of those
individual row actions, then it would not have any idea if within that
batch the action of B (higher XID) actually came first. Without that
knowledge there is a 50/50 chance of getting your replica out of sync
with that simple conflict.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:36:43
Message-ID: 4BFD865B.90406@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 4:11 PM, Dimitri Fontaine wrote:
> So even ordering the txid and txid_snapshots with respect to WAL commit
> time (LSN) won't be the whole story, for any given transaction
> containing more than one event we also need to have them in order. I
> know Jan didn't forget about it so it must either be in the proposal or
> easily derived, too tired to recheck.

No, that detail is actually not explained in the proposal. When applying
all changes in transaction commit order, there is no need for a global
sequence. A local counter per backend is sufficient because the total
order of <xact-commit-order>, <local-xact-seq> yields a similarly
agreeable order of actions.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:45:10
Message-ID: 4BFD8856.8000608@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/05/10 23:32, Jan Wieck wrote:
> Consider two transactions A and B that due to transaction batching
> between snapshots get applied together. Let the order of actions be
>
> 1. A starts
> 2. B starts
> 3. B selects a row for update, then updates the row
> 4. A tries to do the same and blocks
> 5. B commits
> 6. A gets the lock, the row, does the update
> 7. A commits
>
> If Slony (or Londiste) would not record the exact order of those
> individual row actions, then it would not have any idea if within that
> batch the action of B (higher XID) actually came first. Without that
> knowledge there is a 50/50 chance of getting your replica out of sync
> with that simple conflict.

Hmm, I don't see how even a fully reliable WAL-logged commit-order log
would save you then. It seems that you need to not only know the
relative order of commits, but the order of commits relative to actions
within the transactions. I.e. in the above example it's not enough to
know that B committed before A, you also have to know that A updated the
row only after B committed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:49:06
Message-ID: 4BFD8942.7070207@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 12:38 PM, Greg Stark wrote:
> On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> ... but to answer that request, actually I don't even think we should be
>> discussing API specifics.
>>
>
> How about just API generalities? Like, where do you need this data, on
> the master or on the slave? Would PGXC like it on the transaction
> coordinator?
>
> What question do you need to answer, do you need to pull out sets of
> commits in certain ranges or look up specific transaction ids and find
> out when they committed? Or do you only need to answer which of two
> transaction ids committed first?

The question I want answered is

"what was the order and xid of the next 0..n transactions, that
committed after transaction X?"

Preferably I would avoid scanning the entire available WAL just to get
the next n xid's to process.

The proposal assigned a unique serial number (file segment and position
driven) to each xid and used that for the ordering as well as
identification of the last known transaction. That is certainly a
premature implementation detail.

In this implementation it wouldn't even matter if a transaction that was
recorded actually never made it because it crashed before the WAL flush.
It would be reported by this "commit order" feature, but there would be
no traces of whatever it did to be found inside the DB, so that anomaly
is harmless.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 20:52:35
Message-ID: 4BFD8A13.7080106@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/05/10 23:45, Heikki Linnakangas wrote:
> On 26/05/10 23:32, Jan Wieck wrote:
>> Consider two transactions A and B that due to transaction batching
>> between snapshots get applied together. Let the order of actions be
>>
>> 1. A starts
>> 2. B starts
>> 3. B selects a row for update, then updates the row
>> 4. A tries to do the same and blocks
>> 5. B commits
>> 6. A gets the lock, the row, does the update
>> 7. A commits
>>
>> If Slony (or Londiste) would not record the exact order of those
>> individual row actions, then it would not have any idea if within that
>> batch the action of B (higher XID) actually came first. Without that
>> knowledge there is a 50/50 chance of getting your replica out of sync
>> with that simple conflict.
>
> Hmm, I don't see how even a fully reliable WAL-logged commit-order log
> would save you then. It seems that you need to not only know the
> relative order of commits, but the order of commits relative to actions
> within the transactions. I.e. in the above example it's not enough to
> know that B committed before A, you also have to know that A updated the
> row only after B committed.

Ok, I think I understand it now. The commit order is enough, because
replaying the actions in the order "all actions of B, then all actions
of A" yields the same result.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 21:12:26
Message-ID: 4BFD8EBA.70201@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/05/10 23:49, Jan Wieck wrote:
> In this implementation it wouldn't even matter if a transaction that was
> recorded actually never made it because it crashed before the WAL flush.
> It would be reported by this "commit order" feature, but there would be
> no traces of whatever it did to be found inside the DB, so that anomaly
> is harmless.

Hmm, I think it would also not matter if the reported commit order
doesn't match exactly the order of the commit records, as long as
there's no dependency between the two transactions.

What I'm after is that I think it would be enough to establish the
commit order using deferred triggers that are fired during pre-commit
processing. The trigger could get a number from a global sequence to
establish the commit order, and write it to a table. So you still need a
global sequence, but it's only needed once per commit.

(you have to handle deferred triggers that fire after the commit-order
trigger. perhaps by getting another number from the global sequence and
replacing the previous number with it)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 21:13:11
Message-ID: 4BFD8EE7.1080701@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 4:52 PM, Heikki Linnakangas wrote:

> Ok, I think I understand it now. The commit order is enough, because
> replaying the actions in the order "all actions of B, then all actions
> of A" yields the same result.

Precisely.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-26 23:09:25
Message-ID: 4BFDAA25.1030202@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/26/2010 5:12 PM, Heikki Linnakangas wrote:
> On 26/05/10 23:49, Jan Wieck wrote:
>> In this implementation it wouldn't even matter if a transaction that was
>> recorded actually never made it because it crashed before the WAL flush.
>> It would be reported by this "commit order" feature, but there would be
>> no traces of whatever it did to be found inside the DB, so that anomaly
>> is harmless.
>
> Hmm, I think it would also not matter if the reported commit order
> doesn't match exactly the order of the commit records, as long as
> there's no dependency between the two transactions.
>
> What I'm after is that I think it would be enough to establish the
> commit order using deferred triggers that are fired during pre-commit
> processing. The trigger could get a number from a global sequence to
> establish the commit order, and write it to a table. So you still need a
> global sequence, but it's only needed once per commit.

You're not trying to derail this thread into yet another of our famous
"commit trigger" battles, are you?

>
> (you have to handle deferred triggers that fire after the commit-order
> trigger. perhaps by getting another number from the global sequence and
> replacing the previous number with it)

I could imagine a commit trigger as a special case that is fired AFTER
the trigger queue was shut down, so any operation that causes any
further triggers to fire would automatically abort the transaction. A
draconian, but reasonable restriction.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-27 13:59:47
Message-ID: AANLkTikgs0WS2zipDRhiDDWzlfmxIrycHTrDmKGuyipw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> How about just API generalities? Like, where do you need this data, on
> the master or on the slave? Would PGXC like it on the transaction
> coordinator?
>
> What question do you need to answer, do you need to pull out sets of
> commits in certain ranges or look up specific transaction ids and find
> out when they committed? Or do you only need to answer which of two
> transaction ids committed first?
>

This thread has been hard to follow for me. Were any of these
questions answered?

--
greg


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-27 16:01:05
Message-ID: 4BFE9741.40708@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/27/2010 9:59 AM, Greg Stark wrote:
> On Wed, May 26, 2010 at 5:38 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> How about just API generalities? Like, where do you need this data, on
>> the master or on the slave? Would PGXC like it on the transaction
>> coordinator?
>>
>> What question do you need to answer, do you need to pull out sets of
>> commits in certain ranges or look up specific transaction ids and find
>> out when they committed? Or do you only need to answer which of two
>> transaction ids committed first?
>>
>
> This thread has been hard to follow for me. Were any of these
> questions answered?
>

Yes.

On 5/26/2010 4:49 PM, Jan Wieck wrote:
> On 5/26/2010 12:38 PM, Greg Stark wrote:
>> > On Wed, May 26, 2010 at 5:10 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>>> >> ... but to answer that request, actually I don't even think we should be
>>> >> discussing API specifics.
>>> >>
>> >
>> > How about just API generalities? Like, where do you need this data, on
>> > the master or on the slave? Would PGXC like it on the transaction
>> > coordinator?
>> >
>> > What question do you need to answer, do you need to pull out sets of
>> > commits in certain ranges or look up specific transaction ids and find
>> > out when they committed? Or do you only need to answer which of two
>> > transaction ids committed first?
>
> The question I want answered is
>
> "what was the order and xid of the next 0..n transactions, that
> committed after transaction X?"
>
> Preferably I would avoid scanning the entire available WAL just to get
> the next n xid's to process.
>
> The proposal assigned a unique serial number (file segment and position
> driven) to each xid and used that for the ordering as well as
> identification of the last known transaction. That is certainly a
> premature implementation detail.
>
> In this implementation it wouldn't even matter if a transaction that was
> recorded actually never made it because it crashed before the WAL flush.
> It would be reported by this "commit order" feature, but there would be
> no traces of whatever it did to be found inside the DB, so that anomaly
> is harmless.
>
>
> Jan
>
> -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin
> -- Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-27 21:02:59
Message-ID: 4BFEDE03.9010208@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/27/2010 12:01 PM, Jan Wieck wrote:
> On 5/27/2010 9:59 AM, Greg Stark wrote:
>> This thread has been hard to follow for me. Were any of these
>> questions answered?
>>
>
> Yes.

The thing missing is any sort of answer to that problem description.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-28 23:19:54
Message-ID: 201005282319.o4SNJt314194@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> >> Reading the entire WAL just to find all COMMIT records, then go back to
> >> the origin database to get the actual replication log you're looking for
> >> is simpler and more efficient? I don't think so.
> >
> > Agreed, but I think I've not explained myself well enough.
> >
> > I proposed two completely separate ideas; the first one was this:
> >
> > If you must get commit order, get it from WAL on *origin*, using exact
> > same code that current WALSender provides, plus some logic to read
> > through the WAL records and extract commit/aborts. That seems much
> > simpler than the proposal you outlined and as SR shows, its low latency
> > as well since commits write to WAL. No need to generate event ticks
> > either, just use XLogRecPtrs as WALSender already does.
> >
> > I see no problem with integrating that into core, technically or
> > philosophically.
> >
>
> Which means that if I want to allow a consumer of that commit order data
> to go offline for three days or so to replicate the 5 requested, low
> volume tables, the origin needs to hang on to the entire WAL log from
> all 100 other high volume tables?

I suggest writing an external tool that strips out what you need that
can be run at any time, rather than creating a new data format and
overhead for this usecase.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-05-29 01:07:51
Message-ID: 8B1E26F3-DD1E-4794-A878-582ED9338D4C@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On May 28, 2010, at 7:19 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Jan Wieck wrote:
>>>> Reading the entire WAL just to find all COMMIT records, then go
>>>> back to
>>>> the origin database to get the actual replication log you're
>>>> looking for
>>>> is simpler and more efficient? I don't think so.
>>>
>>> Agreed, but I think I've not explained myself well enough.
>>>
>>> I proposed two completely separate ideas; the first one was this:
>>>
>>> If you must get commit order, get it from WAL on *origin*, using
>>> exact
>>> same code that current WALSender provides, plus some logic to read
>>> through the WAL records and extract commit/aborts. That seems much
>>> simpler than the proposal you outlined and as SR shows, its low
>>> latency
>>> as well since commits write to WAL. No need to generate event ticks
>>> either, just use XLogRecPtrs as WALSender already does.
>>>
>>> I see no problem with integrating that into core, technically or
>>> philosophically.
>>>
>>
>> Which means that if I want to allow a consumer of that commit order
>> data
>> to go offline for three days or so to replicate the 5 requested, low
>> volume tables, the origin needs to hang on to the entire WAL log from
>> all 100 other high volume tables?
>
> I suggest writing an external tool that strips out what you need that
> can be run at any time, rather than creating a new data format and
> overhead for this usecase.

That would be FAR more complex, less robust, and less performant -
whereas doing what Jan has proposed is pretty straightforward and
should have minimal impact on performance - or none when not enabled.

...Robert


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-01 15:06:05
Message-ID: 4C0521DD.40806@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 5/28/2010 7:19 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> >> Reading the entire WAL just to find all COMMIT records, then go back to
>> >> the origin database to get the actual replication log you're looking for
>> >> is simpler and more efficient? I don't think so.
>> >
>> > Agreed, but I think I've not explained myself well enough.
>> >
>> > I proposed two completely separate ideas; the first one was this:
>> >
>> > If you must get commit order, get it from WAL on *origin*, using exact
>> > same code that current WALSender provides, plus some logic to read
>> > through the WAL records and extract commit/aborts. That seems much
>> > simpler than the proposal you outlined and as SR shows, its low latency
>> > as well since commits write to WAL. No need to generate event ticks
>> > either, just use XLogRecPtrs as WALSender already does.
>> >
>> > I see no problem with integrating that into core, technically or
>> > philosophically.
>> >
>>
>> Which means that if I want to allow a consumer of that commit order data
>> to go offline for three days or so to replicate the 5 requested, low
>> volume tables, the origin needs to hang on to the entire WAL log from
>> all 100 other high volume tables?
>
> I suggest writing an external tool that strips out what you need that
> can be run at any time, rather than creating a new data format and
> overhead for this usecase.
>

Stripping it out from what?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-01 15:09:06
Message-ID: 201006011509.o51F96U24923@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> >> > I see no problem with integrating that into core, technically or
> >> > philosophically.
> >> >
> >>
> >> Which means that if I want to allow a consumer of that commit order data
> >> to go offline for three days or so to replicate the 5 requested, low
> >> volume tables, the origin needs to hang on to the entire WAL log from
> >> all 100 other high volume tables?
> >
> > I suggest writing an external tool that strips out what you need that
> > can be run at any time, rather than creating a new data format and
> > overhead for this usecase.
> >
>
> Stripping it out from what?

Stripping it from the WAL. Your system seems to require double-writes
on a commit, which is something we have avoided in the past.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-01 16:21:57
Message-ID: 4C0533A5.1080302@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/1/2010 11:09 AM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> >> > I see no problem with integrating that into core, technically or
>> >> > philosophically.
>> >> >
>> >>
>> >> Which means that if I want to allow a consumer of that commit order data
>> >> to go offline for three days or so to replicate the 5 requested, low
>> >> volume tables, the origin needs to hang on to the entire WAL log from
>> >> all 100 other high volume tables?
>> >
>> > I suggest writing an external tool that strips out what you need that
>> > can be run at any time, rather than creating a new data format and
>> > overhead for this usecase.
>> >
>>
>> Stripping it out from what?
>
> Stripping it from the WAL. Your system seems to require double-writes
> on a commit, which is something we have avoided in the past.
>

Your suggestion seems is based on several false assumptions. This does
neither require additional physical writes on commit, nor is consuming
the entire WAL just to filter out commit records anything even remotely
desirable for systems like Londiste or Slony.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-02 17:45:34
Message-ID: 87eigpcoqp.fsf@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

drkp(at)csail(dot)mit(dot)edu (Dan Ports) writes:
> I'm not clear on why the total rowcount is useful, but perhaps I'm
> missing something obvious.

It would make it easy to conclude:

"This next transaction did 8328194 updates. Maybe we should do
some kind of checkpoint (e.g. - commit transaction or such) before
working on it."

versus

"This transaction we're thinking of working on had 7 updates. No
big deal..."
--
(reverse (concatenate 'string "ofni.secnanifxunil" "@" "enworbbc"))
http://linuxfinances.info/info/finances.html
Rules of the Evil Overlord #189. "I will never tell the hero "Yes I
was the one who did it, but you'll never be able to prove it to that
incompetent old fool." Chances are, that incompetent old fool is
standing behind the curtain." <http://www.eviloverlord.com/>


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-02 23:49:00
Message-ID: AANLkTilWwMb2Q-fAFTS7KK5Ep8t7Szndd5bvTkItswEk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne(at)acm(dot)org> wrote:
> It would make it easy to conclude:
>
>   "This next transaction did 8328194 updates.  Maybe we should do
>   some kind of checkpoint (e.g. - commit transaction or such) before
>   working on it."
>
>    versus
>
>   "This transaction we're thinking of working on had 7 updates.  No
>   big deal..."

I'm puzzled how you would define this value. How do you add 7 inserts,
7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
inserts and 7 deletes worth twice as much as the 7 updates when
they're basically the same thing? What if the inserts fired triggers
which inserted 7 more rows, is that 14? What if the 7 updates modified
2 TB of TOAST data but the 8238194 updates were all to the same record
and they were all HOT updates so all it did was change 8kB?

In any case you'll have all the actual data from your triggers or
hooks or whatever so what value does having the system keep track of
this add?

--
greg


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 19:50:29
Message-ID: 4C080785.8040808@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/2/2010 7:49 PM, Greg Stark wrote:
> On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne(at)acm(dot)org> wrote:
>> It would make it easy to conclude:
>>
>> "This next transaction did 8328194 updates. Maybe we should do
>> some kind of checkpoint (e.g. - commit transaction or such) before
>> working on it."
>>
>> versus
>>
>> "This transaction we're thinking of working on had 7 updates. No
>> big deal..."
>
> I'm puzzled how you would define this value. How do you add 7 inserts,
> 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
> inserts and 7 deletes worth twice as much as the 7 updates when
> they're basically the same thing? What if the inserts fired triggers
> which inserted 7 more rows, is that 14? What if the 7 updates modified
> 2 TB of TOAST data but the 8238194 updates were all to the same record
> and they were all HOT updates so all it did was change 8kB?
>
> In any case you'll have all the actual data from your triggers or
> hooks or whatever so what value does having the system keep track of
> this add?

The point is not that we don't have that information now. The point is
having a hint BEFORE wading through possibly gigabytes of WAL or log data.

If getting that information requires to read all the log data twice or
the need to read gigabytes of otherwise useless WAL data (as per Bruce's
suggestion), we better not get it at all and just keep doing what we are
doing now.

I actually have a hard time understanding why people are so opposed to a
feature that has zero impact at all unless a DBA actually turns in ON.
What is the problem with exposing the commit order of transactions?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 20:04:51
Message-ID: 201006032004.o53K4pg28453@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> The point is not that we don't have that information now. The point is
> having a hint BEFORE wading through possibly gigabytes of WAL or log data.
>
> If getting that information requires to read all the log data twice or
> the need to read gigabytes of otherwise useless WAL data (as per Bruce's
> suggestion), we better not get it at all and just keep doing what we are
> doing now.
>
> I actually have a hard time understanding why people are so opposed to a
> feature that has zero impact at all unless a DBA actually turns in ON.
> What is the problem with exposing the commit order of transactions?

If you want to fork Postgres and add it, go ahead, but if the community
has to maintain the code and document it, we care.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Stark" <gsstark(at)mit(dot)edu>,"Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc: "Chris Browne" <cbbrowne(at)acm(dot)org>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 20:11:49
Message-ID: 4C07C6350200002500031ECD@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> wrote:

> I actually have a hard time understanding why people are so
> opposed to a feature that has zero impact at all unless a DBA
> actually turns in ON. What is the problem with exposing the
> commit order of transactions?

FWIW, once I came to understand the use case, it seems to me a
perfectly reasonable and useful thing to have. It does strike me
that there may be value to add one more xid to support certain
types of integrity for some use cases, but that's certainly
something which could be added later, if at all. Once I realized
that, I just dropped out of the discussion; perhaps I should have
bowed out with an endorsement.

Unless my memory is failing me worse than usual, Dan Ports, who is
working on the serializable implementation so he can use the
predicate locking with a transaction-aware caching feature, needs
the ability to track commit order of transactions by xid; so the use
cases go beyond Slony and Londiste.

-Kevin


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 20:21:35
Message-ID: 871vcnx3xs.fsf@cbbrowne-laptop.afilias-int.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

gsstark(at)mit(dot)edu (Greg Stark) writes:
> On Wed, Jun 2, 2010 at 6:45 PM, Chris Browne <cbbrowne(at)acm(dot)org> wrote:
>> It would make it easy to conclude:
>>
>>   "This next transaction did 8328194 updates.  Maybe we should do
>>   some kind of checkpoint (e.g. - commit transaction or such) before
>>   working on it."
>>
>>    versus
>>
>>   "This transaction we're thinking of working on had 7 updates.  No
>>   big deal..."
>
> I'm puzzled how you would define this value. How do you add 7 inserts,
> 7 deletes, and 7 updates? Is that 21 rows modified? Why are the 7
> inserts and 7 deletes worth twice as much as the 7 updates when
> they're basically the same thing? What if the inserts fired triggers
> which inserted 7 more rows, is that 14? What if the 7 updates modified
> 2 TB of TOAST data but the 8238194 updates were all to the same record
> and they were all HOT updates so all it did was change 8kB?

The presence of those questions (and their ambiguity) is the reason
why there's a little squirming as to whether this is super-useful and
super-necessary.

What this offers is *SOME* idea of how much updating work a particular
transaction did. It's a bit worse than you suggest:

- If replication triggers have captured tuples, those would get
counted.

- TOAST updates might lead to extra updates being counted.

But back to where you started, I'd anticipate 7 inserts, 7 deletes,
and 7 updates being counted as something around 21 updates.

And if that included 5 TOAST changes, it might bump up to 26.

If there were replication triggers in place, that might bump the count
up to 45 (which I chose arbitrarily).

> In any case you'll have all the actual data from your triggers or
> hooks or whatever so what value does having the system keep track of
> this add?

This means that when we'd pull the list of transactions to consider,
we'd get something like:

select * from next_transactions('4218:23', 50);

[list of 50 transactions returned, each with...
-> txid
-> START timestamp
-> COMMIT timestamp
-> Approximate # of updates

Then, for each of the 50, I'd pull replication log data for the
corresponding transaction.

If I have the approximate # of updates, that might lead me to stop
short, and say:

"That next update looks like a doozy! I'm going to stop and commit
what I've got before doing that one."

It's not strictly necessary, but would surely be useful for flow
control.
--
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
"MS apparently now has a team dedicated to tracking problems with
Linux and publicizing them. I guess eventually they'll figure out
this back fires... ;)" -- William Burrow <aa126(at)DELETE(dot)fan(dot)nb(dot)ca>


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 20:50:49
Message-ID: 87wrufvo0m.fsf@cbbrowne-laptop.afilias-int.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

bruce(at)momjian(dot)us (Bruce Momjian) writes:
> Jan Wieck wrote:
>> The point is not that we don't have that information now. The point is
>> having a hint BEFORE wading through possibly gigabytes of WAL or log data.
>>
>> If getting that information requires to read all the log data twice or
>> the need to read gigabytes of otherwise useless WAL data (as per Bruce's
>> suggestion), we better not get it at all and just keep doing what we are
>> doing now.
>>
>> I actually have a hard time understanding why people are so opposed to a
>> feature that has zero impact at all unless a DBA actually turns in ON.
>> What is the problem with exposing the commit order of transactions?
>
> If you want to fork Postgres and add it, go ahead, but if the community
> has to maintain the code and document it, we care.

Are you "caring" or "opposing"? It seems rather uncharitable to imply
that Jan doesn't care.

I know *I'm* not interested in a forked Postgres for this - I would
prefer to find out what things could be done that don't involve gross
amounts of WAL file grovelling for data that mayn't necessarily even
be available.
--
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/internet.html
"MS apparently now has a team dedicated to tracking problems with
Linux and publicizing them. I guess eventually they'll figure out
this back fires... ;)" -- William Burrow <aa126(at)DELETE(dot)fan(dot)nb(dot)ca>


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 21:07:51
Message-ID: 4C0819A7.5060409@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> If you want to fork Postgres and add it, go ahead, but if the community
> has to maintain the code and document it, we care.

That comment was rather unprofessional. I think the rest of us still try
to find the best solution for the problem, not kill the discussion. You
may want to rejoin that effort.

I care about an efficient, low overhead way to get a certain
information, that is otherwise extremely difficult, expensive and
version dependent to get.

I care about cleaning up more of the mistakes, made in the original
development of Slony. Namely using hacks and kluges to implement
details, not supported by a current version of PostgreSQL. Londiste and
Slony made a good leap on that with the txid data type. Slony made
another step like that with 2.0, switching to the (for that very purpose
developed and contributed) native trigger configuration instead of
hacking system catalogs. This would be another step in that direction
and we would be able to unify Londiste's and Slony's transport mechanism
and eliminating the tick/sync kluge.

Care to explain what exactly you care about?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 21:58:19
Message-ID: AANLkTilUsoN34UmSrYhYRiMHyOyu1lmVvNjQhX3uTjcZ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> I'm puzzled how you would define this value. How do you add 7 inserts,
>> 7 deletes, and 7 updates? Is that 21 rows modified?
>
> I actually have a hard time understanding why people are so opposed to a
> feature that has zero impact at all unless a DBA actually turns in ON. What
> is the problem with exposing the commit order of transactions?

The post you were responding to was regarding the meaninglessness of
the "number of records" attribute you wanted. Your response is a non
sequitor.

I think the commit order of transactions would be a good thing to
expose though I've asked repeatedly what kind of interface you need
and never gotten answers to all the questions.

--
greg


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: "Chris Browne" <cbbrowne(at)acm(dot)org>,<pgsql-hackers(at)postgresql(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 22:18:17
Message-ID: 4C07E3D90200002500031EF0@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> wrote:

> what kind of interface you need

For the potential uses I can see, it would be great to have a SRF
which took two parameters: xid of last known commit and a limit how
many commits past that to return. Perhaps a negative number could
move earlier in time, if that seems reasonable to others. I think
that's also consistent with Jan's posts. A GUC to enable it and
some way to specify retention (or force cleanup) are the only other
user-facing features which come to mind for me. (Not sure what form
that last should take, but didn't Jan say something about both of
these early in the thread?)

Do you see a need for something else (besides, obviously, docs)?

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Stark" <gsstark(at)mit(dot)edu>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Chris Browne" <cbbrowne(at)acm(dot)org>,<pgsql-hackers(at)postgresql(dot)org>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 22:24:46
Message-ID: 4C07E55E0200002500031EF8@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
>> what kind of interface you need
>
> For the potential uses I can see, it would be great to have a SRF
> which took two parameters: xid of last known commit and a limit
> how many commits past that to return.

Jan's very first post had it right; my idea was flawed:

| Exposing the data will be done via a set returning function. The
| SRF takes two arguments. The maximum number of rows to return and
| the last serial number processed by the reader. The advantage of
| such SRF is that the result can be used in a query that right away
| delivers audit or replication log information in transaction
| commit order. The SRF can return an empty set if no further
| transactions have committed since, or an error if data segments
| needed to answer the request have already been purged.
|
| Purging of the data will be possible in several different ways.
| Autovacuum will call a function that drops segments of the data
| that are outside the postgresql.conf configuration with respect to
| maximum age or data volume. There will also be a function reserved
| for superusers to explicitly purge the data up to a certain serial
| number.

Apologies for not looking back to the start of the thread before
that last post. It was all laid out right at the start.

-Kevin


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 22:29:13
Message-ID: 201006032229.o53MTDd18318@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> > If you want to fork Postgres and add it, go ahead, but if the community
> > has to maintain the code and document it, we care.
>
> That comment was rather unprofessional. I think the rest of us still try
> to find the best solution for the problem, not kill the discussion. You
> may want to rejoin that effort.
>
> I care about an efficient, low overhead way to get a certain
> information, that is otherwise extremely difficult, expensive and
> version dependent to get.
>
> I care about cleaning up more of the mistakes, made in the original
> development of Slony. Namely using hacks and kluges to implement
> details, not supported by a current version of PostgreSQL. Londiste and
> Slony made a good leap on that with the txid data type. Slony made
> another step like that with 2.0, switching to the (for that very purpose
> developed and contributed) native trigger configuration instead of
> hacking system catalogs. This would be another step in that direction
> and we would be able to unify Londiste's and Slony's transport mechanism
> and eliminating the tick/sync kluge.
>
> Care to explain what exactly you care about?

Here is what I was replying to:

> >> I actually have a hard time understanding why people are so opposed t$
> > >> feature that has zero impact at all unless a DBA actually turns in ON.
> >> What is the problem with exposing the commit order of transactions?

Jan's comment is why should others care what he wants because it has
zero impact? I am saying the community cares because we have to
maintain the code. I stand by my comment.

I remember a dismissive comment by Jan when 'session_replication_role'
was added, and a similar strong comment from me at that time as well.
It seems we are doing this again.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 22:31:25
Message-ID: 201006032231.o53MVPR18801@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Jan Wieck wrote:
> > On 6/3/2010 4:04 PM, Bruce Momjian wrote:
> > > If you want to fork Postgres and add it, go ahead, but if the community
> > > has to maintain the code and document it, we care.
> >
> > That comment was rather unprofessional. I think the rest of us still try
> > to find the best solution for the problem, not kill the discussion. You
> > may want to rejoin that effort.
> >
> > I care about an efficient, low overhead way to get a certain
> > information, that is otherwise extremely difficult, expensive and
> > version dependent to get.
> >
> > I care about cleaning up more of the mistakes, made in the original
> > development of Slony. Namely using hacks and kluges to implement
> > details, not supported by a current version of PostgreSQL. Londiste and
> > Slony made a good leap on that with the txid data type. Slony made
> > another step like that with 2.0, switching to the (for that very purpose
> > developed and contributed) native trigger configuration instead of
> > hacking system catalogs. This would be another step in that direction
> > and we would be able to unify Londiste's and Slony's transport mechanism
> > and eliminating the tick/sync kluge.
> >
> > Care to explain what exactly you care about?
>
> Here is what I was replying to:
>
> > >> I actually have a hard time understanding why people are so opposed t$
> > > >> feature that has zero impact at all unless a DBA actually turns in ON.
> > >> What is the problem with exposing the commit order of transactions?
>
> Jan's comment is why should others care what he wants because it has
> zero impact? I am saying the community cares because we have to
> maintain the code. I stand by my comment.
>
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well.
> It seems we are doing this again.

Of course, if I am misintepreting what Jan said, please let me know.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 23:11:05
Message-ID: 1275606522-sup-7982@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010:

> What this offers is *SOME* idea of how much updating work a particular
> transaction did. It's a bit worse than you suggest:
>
> - If replication triggers have captured tuples, those would get
> counted.
>
> - TOAST updates might lead to extra updates being counted.
>
> But back to where you started, I'd anticipate 7 inserts, 7 deletes,
> and 7 updates being counted as something around 21 updates.
>
> And if that included 5 TOAST changes, it might bump up to 26.
>
> If there were replication triggers in place, that might bump the count
> up to 45 (which I chose arbitrarily).

Why not send separate numbers of tuple inserts/updates/deletes, which we
already have from pgstats?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 23:48:04
Message-ID: 4C083F34.6010109@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/3/2010 5:58 PM, Greg Stark wrote:
> On Thu, Jun 3, 2010 at 8:50 PM, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>>> I'm puzzled how you would define this value. How do you add 7 inserts,
>>> 7 deletes, and 7 updates? Is that 21 rows modified?
>>
>> I actually have a hard time understanding why people are so opposed to a
>> feature that has zero impact at all unless a DBA actually turns in ON. What
>> is the problem with exposing the commit order of transactions?
>
> The post you were responding to was regarding the meaninglessness of
> the "number of records" attribute you wanted. Your response is a non
> sequitor.

I never proposed a "number of records" attribute. I proposed a sum of
the row counts in the statistics collector. That row count would be a
mix of insert, update, delete and toast operations. It's not an exact
indicator of anything, but a good enough hint of how much data may come
down the pipe if I were to select all replication data belonging to that
transaction.

>
> I think the commit order of transactions would be a good thing to
> expose though I've asked repeatedly what kind of interface you need
> and never gotten answers to all the questions.

In the original email that started this whole thread I wrote:

> Exposing the data will be done via a set returning function. The SRF
> takes two arguments. The maximum number of rows to return and the last
> serial number processed by the reader. The advantage of such SRF is that
> the result can be used in a query that right away delivers audit or
> replication log information in transaction commit order. The SRF can
> return an empty set if no further transactions have committed since, or
> an error if data segments needed to answer the request have already been
> purged.

Did that not answer your question?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 23:49:27
Message-ID: 4C083F87.4000907@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/3/2010 6:24 PM, Kevin Grittner wrote:

> Apologies for not looking back to the start of the thread before
> that last post. It was all laid out right at the start.

No need to apologize. Happens.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-03 23:52:19
Message-ID: 4C084033.1080401@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/3/2010 7:11 PM, Alvaro Herrera wrote:
> Excerpts from Chris Browne's message of jue jun 03 16:21:35 -0400 2010:
>
>> What this offers is *SOME* idea of how much updating work a particular
>> transaction did. It's a bit worse than you suggest:
>>
>> - If replication triggers have captured tuples, those would get
>> counted.
>>
>> - TOAST updates might lead to extra updates being counted.
>>
>> But back to where you started, I'd anticipate 7 inserts, 7 deletes,
>> and 7 updates being counted as something around 21 updates.
>>
>> And if that included 5 TOAST changes, it might bump up to 26.
>>
>> If there were replication triggers in place, that might bump the count
>> up to 45 (which I chose arbitrarily).
>
> Why not send separate numbers of tuple inserts/updates/deletes, which we
> already have from pgstats?
>

We only have them for the entire database. The purpose of this is just a
guesstimate about what data volume to expect if I were to select all log
from a particular transaction.

This datum isn't critical, just handy for the overall feature to be useful.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 01:32:56
Message-ID: AANLkTil1uhbS9B2etz6C4fon9WjAKSpvRZRKiznuozWX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 3, 2010 at 6:29 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Jan Wieck wrote:
>> On 6/3/2010 4:04 PM, Bruce Momjian wrote:
>> > If you want to fork Postgres and add it, go ahead, but if the community
>> > has to maintain the code and document it, we care.
>>
>> That comment was rather unprofessional. I think the rest of us still try
>> to find the best solution for the problem, not kill the discussion. You
>> may want to rejoin that effort.
>>
>> I care about an efficient, low overhead way to get a certain
>> information, that is otherwise extremely difficult, expensive and
>> version dependent to get.
>>
>> I care about cleaning up more of the mistakes, made in the original
>> development of Slony. Namely using hacks and kluges to implement
>> details, not supported by a current version of PostgreSQL. Londiste and
>> Slony made a good leap on that with the txid data type. Slony made
>> another step like that with 2.0, switching to the (for that very purpose
>> developed and contributed) native trigger configuration instead of
>> hacking system catalogs. This would be another step in that direction
>> and we would be able to unify Londiste's and Slony's transport mechanism
>> and eliminating the tick/sync kluge.
>>
>> Care to explain what exactly you care about?
>
> Here is what I was replying to:
>
>> >> I actually have a hard time understanding why people are so opposed t$
>> > >> feature that has zero impact at all unless a DBA actually turns in ON.
>> >> What is the problem with exposing the commit order of transactions?
>
> Jan's comment is why should others care what he wants because it has
> zero impact?  I am saying the community cares because we have to
> maintain the code.  I stand by my comment.
>
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well.
> It seems we are doing this again.

I think it's entirely legitimate and proper for us to make a decision
about whether this feature is worth including in core PostgreSQL. We
are obviously not in the business of adding random features solely for
the benefit of third-party applications. That having been said, there
are several reasons why I believe that this particular feature is an
excellent candidate for inclusion in core.

1. It solves a problem for which there is no easy workaround.
Rereading all the WAL to extract the commit records is not an easy
workaround, nor is what Slony and Londiste are doing now.

2. It is usable by multiple projects, not just one. It may well have
applications beyond replication (e.g. distributed transactions), but
at a very minimum it is usable by and useful to multiple replication
solutions.

3. It has a clear specification which can be easily understood even by
people who do not fully understand how replication solutions will make
use of it, which makes code maintenance much less burdensome.
Obviously, Jan's original email on this topic was just a sketch, but I
find it to be pretty clear.

4. We have an existing precedent of being willing to add limited
support into core to allow replication solutions to do their thing
(session_replication_role, ALTER TABLE ... ENABLE REPLICA TRIGGER,
etc). Even though we now have built-in replication via HS and SR,
there is still a BIG use case for Slony, Londiste, and other add-on
tools. Making those tools more successful and performant is good for
PostgreSQL.

5. It does not involve highly invasive changes to the core code.

6. It can be turned off for users who don't want it.

I find the skeptical attitude on this thread altogether unwarranted.
Jan made his case and, at least IMHO, presented it pretty clearly. He
then answered, multiple times, numerous questions which were already
addressed in the original email, as well as various others. I think
we should be very careful about assuming that we understand
replication and its needs better than someone who has spent many years
developing one of the major PostgreSQL replication solutions.
Independent of Jan's qualifications, there are clearly several people
on this thread who understand why this is useful and valuable,
including me. I am obviously not in a position to insist that we
accept this feature (assuming Jan produces a patch rather than getting
discouraged and giving up) but I would like us to think very, very
carefully before rejecting it, and not to do so unless we have a DARN
good reason. Most patches add code, and therefore require code
maintenance - that is not, by itself, a reason to reject them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 01:42:15
Message-ID: 201006040142.o541gFo27486@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> > I care about cleaning up more of the mistakes, made in the original
> > development of Slony. Namely using hacks and kluges to implement
> > details, not supported by a current version of PostgreSQL. Londiste and
> > Slony made a good leap on that with the txid data type. Slony made
> > another step like that with 2.0, switching to the (for that very purpose
> > developed and contributed) native trigger configuration instead of
> > hacking system catalogs. This would be another step in that direction
> > and we would be able to unify Londiste's and Slony's transport mechanism
> > and eliminating the tick/sync kluge.
> >
> > Care to explain what exactly you care about?
>
> Here is what I was replying to:
>
> > >> I actually have a hard time understanding why people are so opposed t$
> > > >> feature that has zero impact at all unless a DBA actually turns in ON.
> > >> What is the problem with exposing the commit order of transactions?
>
> Jan's comment is why should others care what he wants because it has
> zero impact? I am saying the community cares because we have to
> maintain the code. I stand by my comment.
>
> I remember a dismissive comment by Jan when 'session_replication_role'
> was added, and a similar strong comment from me at that time as well.
> It seems we are doing this again.

FYI, I talked to Jan on the phone and we have resolved this issue. :-)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 14:44:33
Message-ID: AANLkTin0JMjjLROYUAWOoRHaxh9l7prgr311yjrCh3kq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I find the skeptical attitude on this thread altogether unwarranted.
> Jan made his case and, at least IMHO, presented it pretty clearly.

Just to be clear I think the idea of exposing commit order is a
no-brainer. The specific interface is what I was questioning.

A function which takes a starting xid and a number of transactions to
return seems very tied to one particular application. I could easily
see other systems such as a multi-master system instead only wanting
to compare two transactions to find out which committed first. Or
non-replication applications where you have an LSN and want to know
whether a given transaction had committed by that time.

So one possible interface would be to do something like
xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
an optional argument to limit the number or records returned.

So you could do:

old := select pg_current_xlog_location();
while (1)
{
sleep 60s;
new := select pg_current_xlog_location()
process(select xids_committed_between(old,new))
old := new
}

This might be more useful for PITR recovery for example where you want
to find out what transactions committed between now and some known
point of corruption.

I could also see it being useful to have a function
pg_xlog_location_of_commit(xid). That would let you run recovery until
a particular transaction committed or test whether your replica is
caught up to a particular commit. It could be useful for monitoring
Hot Standby slaves.

>  He
> then answered, multiple times, numerous questions which were already
> addressed in the original email, as well as various others.

I think I did miss some of the original description. That might have
caused some of the difficulty as I was asking questions about
something he assumed he had already answered.

>  I think
> we should be very careful about assuming that we understand
> replication and its needs better than someone who has spent many years
> developing one of the major PostgreSQL replication solutions.

Well the flip side of that is that we want an interface that's useful
for more than just one replication system. This is something basic
enough that I think it will be useful for more than just replication
if we design it generally enough. It should be useful for
backup/restore processes and monitoring as well as various forms of
replication including master-slave trigger based systems but also
including PITR-based replication, log-parsing systems, multi-master
trigger based systems, 2PC-based systems, etc.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 15:14:50
Message-ID: AANLkTilVW00VQBPxBpQ6n3_XknZRI_W6H0LXkiqk7EUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I find the skeptical attitude on this thread altogether unwarranted.
>> Jan made his case and, at least IMHO, presented it pretty clearly.
>
> Just to be clear I think the idea of exposing commit order is a
> no-brainer.  The specific interface is what I was questioning.

OK, thanks for that clarification.

> A function which takes a starting xid and a number of transactions to
> return seems very tied to one particular application. I could easily
> see other systems such as a multi-master system instead only wanting
> to compare two transactions to find out which committed first. Or
> non-replication applications where you have an LSN and want to know
> whether a given transaction had committed by that time.
>
> So one possible interface would be to do something like
> xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
> an optional argument to limit the number or records returned.

I'm imagining that the backend data storage for this would be a file
containing, essentially, a struct for each commit repeated over and
over again, packed tightly. It's easy to index into such a file using
a sequence number (give me the 1000'th commit) but searching by LSN
would require (a) storing the LSNs and (b) binary search. Maybe it's
worth adding that complexity, but I'm not sure that it is. Keeping
the size of this file small is important for ensuring that it has
minimal performance impact (which is also why I'm not sold on trying
to include the tuple counters that Jan proposed - I think we can solve
the problem he's worried about there more cleanly in other ways).

> So you could do:
>
> old := select pg_current_xlog_location();
> while (1)
> {
>  sleep 60s;
>  new := select pg_current_xlog_location()
>  process(select xids_committed_between(old,new))
>  old := new
> }
>
> This might be more useful for PITR recovery for example where you want
> to find out what transactions committed between now and some known
> point of corruption.

This could also be done by selecting the current commit sequence
number, getting the XIDs committed between the two commit sequence
numbers, etc.

> I could also see it being useful to have a function
> pg_xlog_location_of_commit(xid). That would let you run recovery until
> a particular transaction committed or test whether your replica is
> caught up to a particular commit. It could be useful for monitoring
> Hot Standby slaves.

Well, you'd need to index the commit data to make that work, I think,
so that adds a lot of complexity. The implementation as proposed lets
you find the commits after a known point in order of occurrence, but
it doesn't let you inquire about the location of a particular commit.
If you want to run recovery until a particular transaction commits, we
could teach the recovery code to look for the commit record for that
XID and then pause at that point (or just before that point, if
someone wanted that as an alternative behavior), which would be much
simpler than using this mechanism. And if you want to check whether
slaves are caught up, it would probably be better to use LSN rather
than commits, because you could be caught up on commits but way behind
on WAL replay.

>>  I think
>> we should be very careful about assuming that we understand
>> replication and its needs better than someone who has spent many years
>> developing one of the major PostgreSQL replication solutions.
>
> Well the flip side of that is that we want an interface that's useful
> for more than just one replication system. This is something basic
> enough that I think it will be useful for more than just replication
> if we design it generally enough. It should be useful for
> backup/restore processes and monitoring as well as various forms of
> replication including master-slave trigger based systems but also
> including PITR-based replication, log-parsing systems, multi-master
> trigger based systems, 2PC-based systems, etc.

Making it general enough to serve multiple needs is good, but we've
got to make sure that the extra complexity is buying us something.
Jan seems pretty confident that this could be used by Londiste also,
though it would be nice to have some confirmation from the Londiste
developer(s) on that. I think it may also have applications for
distributed transactions and multi-master replication, but I am not
too sure it helps much for PITR-based replication or log-parsing
systems. We want to design something that is good, but trying to
solve too many problems may end up solving none of them well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Jan Wieck <janwieck(at)yahoo(dot)com>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 16:52:49
Message-ID: 1275670221-sup-2744@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010:
> On 6/3/2010 7:11 PM, Alvaro Herrera wrote:

> > Why not send separate numbers of tuple inserts/updates/deletes, which we
> > already have from pgstats?
>
> We only have them for the entire database. The purpose of this is just a
> guesstimate about what data volume to expect if I were to select all log
> from a particular transaction.

But we already have per table counters. Couldn't we aggregate them per
transaction as well, if this feature is enabled? I'm guessing that this
is going to have some uses besides Slony; vague measurements could turn
out to be unusable for some of these.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 20:02:12
Message-ID: 4C095BC4.7060307@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/4/2010 10:44 AM, Greg Stark wrote:
> On Fri, Jun 4, 2010 at 2:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I find the skeptical attitude on this thread altogether unwarranted.
>> Jan made his case and, at least IMHO, presented it pretty clearly.
>
> Just to be clear I think the idea of exposing commit order is a
> no-brainer. The specific interface is what I was questioning.
>
> A function which takes a starting xid and a number of transactions to
> return seems very tied to one particular application. I could easily
> see other systems such as a multi-master system instead only wanting
> to compare two transactions to find out which committed first. Or
> non-replication applications where you have an LSN and want to know
> whether a given transaction had committed by that time.

Read the proposal again. I mean the original mail that started this
tread. The function does NOT take an xid as argument.

Being able to compare two xid's against each other with respect to their
commit order is eventually useful. The serial number of the data set,
returned by the SRF as proposed, would perfectly satisfy that need. But
not the way you envision for multimaster. Multimaster would ask "did xid
X from server A commit before or after xid Y from server B?" That is a
question completely outside the scope of this proposal.

Please keep it real.

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-04 20:20:30
Message-ID: 4C09600E.6080501@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/4/2010 12:52 PM, Alvaro Herrera wrote:
> Excerpts from Jan Wieck's message of jue jun 03 19:52:19 -0400 2010:
>> On 6/3/2010 7:11 PM, Alvaro Herrera wrote:
>
>> > Why not send separate numbers of tuple inserts/updates/deletes, which we
>> > already have from pgstats?
>>
>> We only have them for the entire database. The purpose of this is just a
>> guesstimate about what data volume to expect if I were to select all log
>> from a particular transaction.
>
> But we already have per table counters. Couldn't we aggregate them per
> transaction as well, if this feature is enabled? I'm guessing that this
> is going to have some uses besides Slony; vague measurements could turn
> out to be unusable for some of these.

We have them per table and per index, summarized over all transactions.
It is debatable if bloating this feature with detailed statistics is
useful or not, but I'd rather not have that bloat at the beginning,
because otherwise I know exactly what is going to happen. People will
just come back and say "zero impact my a..".

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Bruce Momjian <bruce(at)momjian(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-10 22:10:50
Message-ID: AANLkTikon8oSaoqGb6zYa1kQtPyaBxyFCd_ZgugLkoDB@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/4/10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > A function which takes a starting xid and a number of transactions to
> > return seems very tied to one particular application. I could easily
> > see other systems such as a multi-master system instead only wanting
> > to compare two transactions to find out which committed first. Or
> > non-replication applications where you have an LSN and want to know
> > whether a given transaction had committed by that time.
> >
> > So one possible interface would be to do something like
> > xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
> > an optional argument to limit the number or records returned.
>
>
> I'm imagining that the backend data storage for this would be a file
> containing, essentially, a struct for each commit repeated over and
> over again, packed tightly. It's easy to index into such a file using
> a sequence number (give me the 1000'th commit) but searching by LSN
> would require (a) storing the LSNs and (b) binary search. Maybe it's
> worth adding that complexity, but I'm not sure that it is. Keeping
> the size of this file small is important for ensuring that it has
> minimal performance impact (which is also why I'm not sold on trying
> to include the tuple counters that Jan proposed - I think we can solve
> the problem he's worried about there more cleanly in other ways).

AIUI, you index the file by offset.

> >> I think
> >> we should be very careful about assuming that we understand
> >> replication and its needs better than someone who has spent many years
> >> developing one of the major PostgreSQL replication solutions.
> >
> > Well the flip side of that is that we want an interface that's useful
> > for more than just one replication system. This is something basic
> > enough that I think it will be useful for more than just replication
> > if we design it generally enough. It should be useful for
> > backup/restore processes and monitoring as well as various forms of
> > replication including master-slave trigger based systems but also
> > including PITR-based replication, log-parsing systems, multi-master
> > trigger based systems, 2PC-based systems, etc.
>
>
> Making it general enough to serve multiple needs is good, but we've
> got to make sure that the extra complexity is buying us something.
> Jan seems pretty confident that this could be used by Londiste also,
> though it would be nice to have some confirmation from the Londiste
> developer(s) on that. I think it may also have applications for
> distributed transactions and multi-master replication, but I am not
> too sure it helps much for PITR-based replication or log-parsing
> systems. We want to design something that is good, but trying to
> solve too many problems may end up solving none of them well.

The potential for single shared queue implementation, with
the additional potential for merging async replication
implementations sounds attractive. (Merging ~ having
single one that satisfies broad range of needs.)

Unless the functionality accepted into core will be limited
to replication only and/or performs worse than current
snapshot-based grouping. Then it is uninteresting, of course.

Jan's proposal of storing small struct into segmented files
sounds like it could work. Can't say anything more because
I can't imagine it as well as Jan. Would need to play with
working implementation to say more...

--
marko


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Bruce Momjian <bruce(at)momjian(dot)us>, Jan Wieck <janwieck(at)yahoo(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-06-14 20:36:07
Message-ID: 1276547694-sup-4057@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Marko Kreen's message of jue jun 10 18:10:50 -0400 2010:

> Jan's proposal of storing small struct into segmented files
> sounds like it could work. Can't say anything more because
> I can't imagine it as well as Jan. Would need to play with
> working implementation to say more...

We already have such a thing -- see pg_multixact

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Exposing the Xact commit order to the user
Date: 2010-08-30 09:18:53
Message-ID: 1283159933.1800.1844.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:

> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

This idea had support from at least 6 hackers. I'm happy to add my own.

Can I suggest it is added as a hook, rather than argue about the details
too much? The main use case is in combination with external systems, so
that way we can maintain the relevant code with the system that cares
about it.

> CommitTransaction() inside of xact.c will call a function, that inserts
> a new record into this array. The operation will for most of the time be
> nothing than taking a spinlock and adding the record to shared memory.
> All the data for the record is readily available, does not require
> further locking and can be collected locally before taking the spinlock.
> The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
> commit_timestamp is what CommitTransaction() just decided to write into
> the WAL commit record and the total_rowcount is the sum of inserted,
> updated and deleted heap tuples during the transaction, which should be
> easily available from the statistics collector, unless row stats are
> disabled, in which case the datum would be zero.

Does this need to be called while in a critical section? Or can we wait
until after the actual marking of the commit before calling this?

> Checkpoint handling will call a function to flush the shared buffers.
> Together with this, the information from WAL records will be sufficient
> to recover this data (except for row counts) during crash recovery.

So it would need to work identically in recovery also?

These two values are not currently stored in the commit WAL record.

timestamptz xci_begin_timestamp
int64 xci_total_rowcount

Both of those seem optional, so I don't really want them added to WAL.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services