Quick Links

Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node

Lists:	pgsql-hackers

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<andres(at)2ndquadrant(dot)com>,<simon(at)2ndquadrant(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 12:37:04
Message-ID:	4FE17DA002000025000487AE@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Heikki Linnakangas wrote:

> I don't like the idea of adding the origin id to the record header.
> It's only required in some occasions, and on some record types.

Right.

> And I'm worried it might not even be enough in more complicated
> scenarios.
>
> Perhaps we need a more generic WAL record annotation system, where
> a plugin can tack arbitrary information to WAL records. The extra
> information could be stored in the WAL record after the rmgr
> payload, similar to how backup blocks are stored. WAL replay could
> just ignore the annotations, but a replication system could use it
> to store the origin id or whatever extra information it needs.

Not only would that handle absolute versus relative updates and
origin id, but application frameworks could take advantage of such a
system for passing transaction metadata. I've held back on one
concern so far that I'll bring up now because this suggestion would
address it nicely.

Our current trigger-driven logical replication includes a summary
which includes transaction run time, commit time, the transaction
type identifier, the source code line from which that transaction was
invoked, the user ID with which the user connected to the application
(which isn't the same as the database login), etc. Being able to
"decorate" a database transaction with arbitrary (from the DBMS POV)
metadata would be very valuable. In fact, our shop can't maintain
the current level of capabilities without *some* way to associate
such information with a transaction.

I think that using up the only unused space in the fixed header to
capture one piece of the transaction metadata needed for logical
replication, and that only in some configurations, is short-sighted.
If we solve the general problem of transaction metadata, this one
specific case will fall out of that.

I think removing origin ID from this patch and submitting a separate
patch for a generalized transaction metadata system is the sensible
way to go.

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	heikki(dot)linnakangas(at)enterprisedb(dot)com, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 13:46:46
Message-ID:	CA+U5nMK_fC5TWCFyUQ2nCYi_Rd=4YhRtviBWAbYzZi-1idYPeg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 June 2012 20:37, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Heikki Linnakangas wrote:
>
>> I don't like the idea of adding the origin id to the record header.
>> It's only required in some occasions, and on some record types.
>
> Right.

Wrong, as explained.

>> And I'm worried it might not even be enough in more complicated
>> scenarios.
>>
>> Perhaps we need a more generic WAL record annotation system, where
>> a plugin can tack arbitrary information to WAL records. The extra
>> information could be stored in the WAL record after the rmgr
>> payload, similar to how backup blocks are stored. WAL replay could
>> just ignore the annotations, but a replication system could use it
>> to store the origin id or whatever extra information it needs.
>
> Not only would that handle absolute versus relative updates and
> origin id, but application frameworks could take advantage of such a
> system for passing transaction metadata. I've held back on one
> concern so far that I'll bring up now because this suggestion would
> address it nicely.
>
> Our current trigger-driven logical replication includes a summary
> which includes transaction run time, commit time, the transaction
> type identifier, the source code line from which that transaction was
> invoked, the user ID with which the user connected to the application
> (which isn't the same as the database login), etc. Being able to
> "decorate" a database transaction with arbitrary (from the DBMS POV)
> metadata would be very valuable. In fact, our shop can't maintain
> the current level of capabilities without *some* way to associate
> such information with a transaction.

> I think that using up the only unused space in the fixed header to
> capture one piece of the transaction metadata needed for logical
> replication, and that only in some configurations, is short-sighted.
> If we solve the general problem of transaction metadata, this one
> specific case will fall out of that.

The proposal now includes flag bits that would allow the addition of a
variable length header, should that ever become necessary. So the
unused space in the fixed header is not being "used up" as you say. In
any case, the fixed header still has 4 wasted bytes on 64bit systems
even after the patch is applied. So this claim of short sightedness is
just plain wrong.

It isn't true that this is needed only for some configurations of
multi-master, per discussion.

This is not transaction metadata, it is WAL record metadata required
for multi-master replication, see later point.

We need to add information to every WAL record that is used as the
source for generating LCRs. It is also possible to add this to HEAP
and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
using the *currently wasted* bytes on a WAL record header does *not*
bloat the WAL stream.

> I think removing origin ID from this patch and submitting a separate
> patch for a generalized transaction metadata system is the sensible
> way to go.

We already have a very flexible WAL system for recording data of
interest to various resource managers. If you wish to annotate a
transaction, you can either generate a new kind of WAL record or you
can enhance a commit record. There are already unused flag bits on
commit records for just such a purpose.

XLOG_NOOP records can already be generated by your application if you
wish to inject additional metadata to the WAL stream. So no changes
are required for you to implement the generalised transaction metadata
scheme you say you require.

Not sure how or why that relates to requirements for multi-master.

Please note that I've suggested review changes to Andres' work myself.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	<andres(at)2ndquadrant(dot)com>,<heikki(dot)linnakangas(at)enterprisedb(dot)com>, <robertmhaas(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 15:34:42
Message-ID:	4FE1A74202000025000487FB@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>> Heikki Linnakangas wrote:
>>
>>> I don't like the idea of adding the origin id to the record
>>> header. It's only required in some occasions, and on some record
>>> types.
>>
>> Right.
>
> Wrong, as explained.

The point is not wrong; you are simply not responding to what is
being said.

You have not explained why an origin ID is required when there is no
replication, or if there is master/slave logical replication, or
there are multiple masters with non-overlapping primary keys
replicating to a single table in a consolidated database, or each
master replicates to all other masters directly, or any of various
other scenarios raised on this thread. You've only explained why
it's necessary for certain configurations of multi-master
replication where all rows in a table can be updated on any of the
masters. I understand that this is the configuration you find most
interesting, at least for initial implementation. That does not
mean that the other situations don't exist as use cases or should be
not be considered in the overall design.

I don't think there is anyone here who would not love to see this
effort succeed, all the way to multi-master replication in the
configuration you are emphasizing. What is happening is that people
are expressing concerns about parts of the design which they feel
are problematic, and brainstorming about possible alternatives. As
I'm sure you know, fixing a design problem at this stage in
development is a lot less expensive than letting the problem slide
and trying to deal with it later.

> It isn't true that this is needed only for some configurations of
> multi-master, per discussion.

I didn't get that out of the discussion; I saw a lot of cases
mentioned as not needing it to which you simply did not respond.

> This is not transaction metadata, it is WAL record metadata
> required for multi-master replication, see later point.
>
> We need to add information to every WAL record that is used as the
> source for generating LCRs.

If the origin ID of a transaction doesn't count as transaction
metadata (i.e., data about the transaction), what does? It may be a
metadata element about which you have special concerns, but it is
transaction metadata. You don't plan on supporting individual WAL
records within a transaction containing different values for origin
ID, do you? If not, why is it something to store in every WAL
record rather than once per transaction? That's not intended to be
a rhetorical question. I think it's because you're still thinking
of the WAL stream as *the medium* for logical replication data
rather than *the source* of logical replication data.

As long as the WAL stream is the medium, options are very
constrained. You can code a very fast engine to handle a single
type of configuration that way, and perhaps that should be a
supported feature, but it's not a configuration I've needed yet.
(Well, on reflection, if it had been available and easy to use, I
can think of *one* time I *might* have used it for a pair of nodes.)
It seems to me that you are so focused on this one use case that you
are not considering how design choices which facilitate fast
development of that use case paint us into a corner in terms of
expanding to other use cases.

>> I think removing origin ID from this patch and submitting a
>> separate patch for a generalized transaction metadata system is
>> the sensible way to go.
>
> We already have a very flexible WAL system for recording data of
> interest to various resource managers. If you wish to annotate a
> transaction, you can either generate a new kind of WAL record or
> you can enhance a commit record.

Right. Like many of us are suggesting should be done for origin ID.

> XLOG_NOOP records can already be generated by your application if
> you wish to inject additional metadata to the WAL stream. So no
> changes are required for you to implement the generalised
> transaction metadata scheme you say you require.

I'm glad it's that easy. Are there SQL functions to for that yet?

> Not sure how or why that relates to requirements for multi-master.

That depends on whether you want to leave the door open to other
logical replication than the one use case on which you are currently
focused. I even consider some of those other cases multi-master,
especially when multiple databases are replicating to a single table
on another server. I'm not clear on your definition -- it seems to
be rather more narrow. Maybe we need to define some terms somewhere
to facilitate discussion. Is there a Wiki page where that would
make sense?

-Kevin

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 15:50:18
Message-ID:	201206201750.18655.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> > This is not transaction metadata, it is WAL record metadata
> > required for multi-master replication, see later point.

> > We need to add information to every WAL record that is used as the
> > source for generating LCRs.
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does? It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata. You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you? If not, why is it something to store in every WAL
> record rather than once per transaction? That's not intended to be
> a rhetorical question.
Its definitely possible to store it per transaction (see the discussion around
http://archives.postgresql.org/message-
id/201206201605(dot)43634(dot)andres(at)2ndquadrant(dot)com) it just makes the filtering via
the originating node a considerably more complex thing. With our proposal you
can do it without any complexity involved, on a low level. Storing it per
transaction means you can only stream out the data to other nodes *after*
fully reassembling the transaction. Thats a pitty, especially if we go for a
design where the decoding happens in a proxy instance.

Other metadata will not be needed on such a low level.

I also have to admit that I am very hesitant to start developing some generic
"transaction metadata" framework atm. That seems to be a good way to spend a
good part of time in discussion and disagreeing. Imo thats something for
later.

> I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.
I don't think thats true. See the above referenced subthread for reasons why I
think the origin id is important.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	andres(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 16:57:29
Message-ID:	CA+U5nM+fPynmZ+L_qB1fFMkkVc8k3vyKYxXBGEmEO5A5Nt7qhw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 June 2012 23:34, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>>> Heikki Linnakangas wrote:
>>>
>>>> I don't like the idea of adding the origin id to the record
>>>> header. It's only required in some occasions, and on some record
>>>> types.
>>>
>>> Right.
>>
>> Wrong, as explained.
>
> The point is not wrong; you are simply not responding to what is
> being said.

Heikki said that the origin ID was not required for all MMR
configs/scenarios. IMHO that is wrong, with explanation given.

By agreeing with him, I assumed you were sharing that assertion,
rather than saying something else.

> You have not explained why an origin ID is required when there is no
> replication, or if there is master/slave logical replication, or
...

You're right; I never claimed it was needed. Origin Id is only needed
for multi-master replication and that is the only context I've
discussed it.

>> This is not transaction metadata, it is WAL record metadata
>> required for multi-master replication, see later point.
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs.
>
> If the origin ID of a transaction doesn't count as transaction
> metadata (i.e., data about the transaction), what does? It may be a
> metadata element about which you have special concerns, but it is
> transaction metadata. You don't plan on supporting individual WAL
> records within a transaction containing different values for origin
> ID, do you? If not, why is it something to store in every WAL
> record rather than once per transaction? That's not intended to be
> a rhetorical question. I think it's because you're still thinking
> of the WAL stream as *the medium* for logical replication data
> rather than *the source* of logical replication data.

> As long as the WAL stream is the medium, options are very
> constrained. You can code a very fast engine to handle a single
> type of configuration that way, and perhaps that should be a
> supported feature, but it's not a configuration I've needed yet.
> (Well, on reflection, if it had been available and easy to use, I
> can think of *one* time I *might* have used it for a pair of nodes.)
> It seems to me that you are so focused on this one use case that you
> are not considering how design choices which facilitate fast
> development of that use case paint us into a corner in terms of
> expanding to other use cases.

>>> I think removing origin ID from this patch and submitting a
>>> separate patch for a generalized transaction metadata system is
>>> the sensible way to go.
>>
>> We already have a very flexible WAL system for recording data of
>> interest to various resource managers. If you wish to annotate a
>> transaction, you can either generate a new kind of WAL record or
>> you can enhance a commit record.
>
> Right. Like many of us are suggesting should be done for origin ID.
>
>> XLOG_NOOP records can already be generated by your application if
>> you wish to inject additional metadata to the WAL stream. So no
>> changes are required for you to implement the generalised
>> transaction metadata scheme you say you require.
>
> I'm glad it's that easy. Are there SQL functions to for that yet?

Yes, another possible design is to generate a new kind of WAL record
for the origin id.

Doing it that way will slow down multi-master by a measurable amount,
and slightly bloat the WAL stream.

The proposed way uses space that is currently wasted and likely to
remain so. Only 2 bytes of 6 bytes available are proposed for use,
with a flag design that allows future extension if required. When MMR
is not in use, the WAL records would look completely identical to the
way they look now, in size, settings and speed of writing them.

Putting the origin id onto each WAL record allows very fast and simple
stateless filtering. I suggest using it because those bytes have been
sitting there unused for close to 10 years now and no better use
springs to mind.

The proposed design is the fastest way of implementing MMR, without
any loss for non-users.

As I noted before, slowing down MMR by a small amount causes geometric
losses in performance across the whole cluster.

>> Not sure how or why that relates to requirements for multi-master.
>
> That depends on whether you want to leave the door open to other
> logical replication than the one use case on which you are currently
> focused. I even consider some of those other cases multi-master,
> especially when multiple databases are replicating to a single table
> on another server. I'm not clear on your definition -- it seems to
> be rather more narrow. Maybe we need to define some terms somewhere
> to facilitate discussion. Is there a Wiki page where that would
> make sense?

The project is called BiDirectional Replication to ensure that people
understood this is not just multi-master. But that doesn't mean that
multi-master can't have its own specific requirements.

Adding originid is also useful for the use case you mention, since its
useful to know where the data came from for validation. So having an
originid on each insert record would be important. That case must also
handle conflicts from duplicate inserts, and originid priority is then
an option for conflict handling.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 17:06:28
Message-ID:	CAFNqd5U3XPP2uurg+sn2M2RtsxDwj36y_FDm+N_u31zQfHNLpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>> > This is not transaction metadata, it is WAL record metadata
>> > required for multi-master replication, see later point.
>
>> > We need to add information to every WAL record that is used as the
>> > source for generating LCRs.
>> If the origin ID of a transaction doesn't count as transaction
>> metadata (i.e., data about the transaction), what does? It may be a
>> metadata element about which you have special concerns, but it is
>> transaction metadata. You don't plan on supporting individual WAL
>> records within a transaction containing different values for origin
>> ID, do you? If not, why is it something to store in every WAL
>> record rather than once per transaction? That's not intended to be
>> a rhetorical question.
> Its definitely possible to store it per transaction (see the discussion around
> http://archives.postgresql.org/message-
> id/201206201605(dot)43634(dot)andres(at)2ndquadrant(dot)com) it just makes the filtering via
> the originating node a considerably more complex thing. With our proposal you
> can do it without any complexity involved, on a low level. Storing it per
> transaction means you can only stream out the data to other nodes *after*
> fully reassembling the transaction. Thats a pitty, especially if we go for a
> design where the decoding happens in a proxy instance.

I guess I'm not seeing the purpose to having the origin node id in the
WAL stream either.

We have it in the Slony sl_log_* stream, however there is a crucial
difference, in that sl_log_* is expressly a shared structure. In
contrast, WAL isn't directly sharable; you don't mix together multiple
WAL streams.

It seems as though the point in time at which you need to know the
origin ID is the moment at which you're deciding to read data from the
WAL files, and knowing which stream you are reading from is an
assertion that might be satisfied by looking at configuration that
doesn't need to be in the WAL stream itself. It might be *nice* for
the WAL stream to be self-identifying, but that doesn't seem to be
forcibly necessary.

The case where it *would* be needful is if you are in the process of
assembling together updates coming in from multiple masters, and need
to know:
- This INSERT was replicated from node #1, so should be ignored downstream
- That INSERT was replicated from node #2, so should be ignored downstream
- This UPDATE came from the local node, so needs to be passed to
downstream users

Or perhaps something else is behind the node id being deeply embedded
into the stream that I'm not seeing altogether.

> Other metadata will not be needed on such a low level.
>
> I also have to admit that I am very hesitant to start developing some generic
> "transaction metadata" framework atm. That seems to be a good way to spend a
> good part of time in discussion and disagreeing. Imo thats something for
> later.

Well, I see there being a use in there being at least 3 sorts of LCR records:
a) Capturing literal SQL that is to replayed downstream. This
parallels two use cases existing in existing replication systems:
i) In pre-2.2 versions of Slony, statements are replayed literally.
So there's a stream of INSERT/UPDATE/DELETE statements.
ii) DDL capture and replay. In existing replication systems, DDL
isn't captured implicitly, the way Dimitri's Event Triggers are to do,
but rather is captured explicitly.
There should be a function to allow injecting such SQL explicitly;
that is sure to be a useful sort of thing to be able to do.

b) Capturing tuple updates in a binary form that can be turned readily
into heap updates on a replica.
Unfortunately, this form is likely not to play well when
replicating across platforms or Postgres versions, so I suspect that
this performance optimization should be implemented as a *last*
resort, rather than first. Michael Jackson had some "rules of
optimization" that said "don't do it", and, for the expert, "don't do
it YET..."

c) Capturing tuple data in some reasonably portable and readily
re-writable form.
Slony 2.2 changes from "SQL fragments" (of a) i) above) to storing
updates as an array of text values indicating:
- relation name
- attribute names
- attribute values, serialized into strings
I don't know that this provably represents the *BEST*
representation, but it definitely will be portable where b) would not
be, and lends itself to being able to reuse query plans, where a)
requires extraordinary amounts of parsing work, today. So I'm pretty
sure it's better than a) and b) for a sizable set of cases.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Christopher Browne <cbbrowne(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 17:52:59
Message-ID:	201206201952.59865.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Chris!

On Wednesday, June 20, 2012 07:06:28 PM Christopher Browne wrote:
> On Wed, Jun 20, 2012 at 11:50 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> > On Wednesday, June 20, 2012 05:34:42 PM Kevin Grittner wrote:
> >> Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> >> > This is not transaction metadata, it is WAL record metadata
> >> > required for multi-master replication, see later point.
> >> >
> >> > We need to add information to every WAL record that is used as the
> >> > source for generating LCRs.
> >>
> >> If the origin ID of a transaction doesn't count as transaction
> >> metadata (i.e., data about the transaction), what does? It may be a
> >> metadata element about which you have special concerns, but it is
> >> transaction metadata. You don't plan on supporting individual WAL
> >> records within a transaction containing different values for origin
> >> ID, do you? If not, why is it something to store in every WAL
> >> record rather than once per transaction? That's not intended to be
> >> a rhetorical question.
> >
> > Its definitely possible to store it per transaction (see the discussion
> > around http://archives.postgresql.org/message-
> > id/201206201605(dot)43634(dot)andres(at)2ndquadrant(dot)com) it just makes the filtering
> > via the originating node a considerably more complex thing. With our
> > proposal you can do it without any complexity involved, on a low level.
> > Storing it per transaction means you can only stream out the data to
> > other nodes *after* fully reassembling the transaction. Thats a pitty,
> > especially if we go for a design where the decoding happens in a proxy
> > instance.
>
> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
>
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure. In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.
>
> It seems as though the point in time at which you need to know the
> origin ID is the moment at which you're deciding to read data from the
> WAL files, and knowing which stream you are reading from is an
> assertion that might be satisfied by looking at configuration that
> doesn't need to be in the WAL stream itself. It might be *nice* for
> the WAL stream to be self-identifying, but that doesn't seem to be
> forcibly necessary.
>
> The case where it *would* be needful is if you are in the process of
> assembling together updates coming in from multiple masters, and need
> to know:
> - This INSERT was replicated from node #1, so should be ignored
> downstream - That INSERT was replicated from node #2, so should be ignored
> downstream - This UPDATE came from the local node, so needs to be passed
> to downstream users
Exactly that is the point. And you want to do that in an efficient manner
without too much logic, thats why something simple like the record header is
so appealing.

> > I also have to admit that I am very hesitant to start developing some
> > generic "transaction metadata" framework atm. That seems to be a good
> > way to spend a good part of time in discussion and disagreeing. Imo
> > thats something for later.

> Well, I see there being a use in there being at least 3 sorts of LCR
> records:
> a) Capturing literal SQL that is to replayed downstream
> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> c) Capturing tuple data in some reasonably portable and readily
> re-writable form
I think we should provide the utilities to do all of those. a) is a
consequence of being able to do c).

That doesn't really have something to do with this subthread though? The part
you quoted above was my response to the suggestion to add some generic
framework to attach metadata to individual transactions on the generating
side. We quite possibly will end up needing that but I personally don't think
we should designing that part atm.

> b) Capturing tuple updates in a binary form that can be turned readily
> into heap updates on a replica.
> Unfortunately, this form is likely not to play well when
> replicating across platforms or Postgres versions, so I suspect that
> this performance optimization should be implemented as a *last*
> resort, rather than first. Michael Jackson had some "rules of
> optimization" that said "don't do it", and, for the expert, "don't do
> it YET..."
Well, apply is a bottleneck. Besides field experience I/We have benchmarked it
and its rather plausible that it is. And I don't think we can magically make
that faster in pg in general so my plan is to remove the biggest cost factor I
can see.
And yes, it will have restrictions...

Regards,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, heikki(dot)linnakangas(at)enterprisedb(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 18:01:09
Message-ID:	CA+U5nMLv4_SShO_TQj=at99kCU+Pp4BH5WtCcbKy6N5Htuuf5A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 01:06, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:

> I guess I'm not seeing the purpose to having the origin node id in the
> WAL stream either.
>
> We have it in the Slony sl_log_* stream, however there is a crucial
> difference, in that sl_log_* is expressly a shared structure. In
> contrast, WAL isn't directly sharable; you don't mix together multiple
> WAL streams.

Unfortunately you do. That's really the core of how this differs from
current Slony.

Every change we make creates WAL records. Whether that is changes
originating on the current node, or changes originating on upstream
nodes that need to be applied on the current node.

The WAL stream is then read and filtered for changes to pass onto
other nodes. So we want to be able to filter out the applied changes
to avoid passing them back to the original nodes.

Having each record know the origin makes the filtering much simpler,
so if its possible to do it efficiently then its the best design. It
turns out to be the best way to do this so far known. There are other
design however, as noted. In all cases we need the origin id in the
WAL.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 18:45:24
Message-ID:	4FE21A44.8090701@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.06.2012 16:46, Simon Riggs wrote:
> The proposal now includes flag bits that would allow the addition of a
> variable length header, should that ever become necessary. So the
> unused space in the fixed header is not being "used up" as you say. In
> any case, the fixed header still has 4 wasted bytes on 64bit systems
> even after the patch is applied. So this claim of short sightedness is
> just plain wrong.
>
> ...
>
> We need to add information to every WAL record that is used as the
> source for generating LCRs. It is also possible to add this to HEAP
> and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
> using the *currently wasted* bytes on a WAL record header does *not*
> bloat the WAL stream.

Or, we could provide a mechanism for resource managers to use those
padding bytes for whatever data the wish to use. Or modify the record
format so that the last 4 bytes of the data in the WAL record are always
automatically stored in those padding bytes, thus making all WAL records
4 bytes shorter. That would make the WAL even more compact, with only a
couple of extra CPU instructions in the critical path.

My point is that it's wrong to think that it's free to use those bytes,
just because they're currently unused. If we use them for one thing, we
can't use them for other things anymore. If we're so concerned about WAL
bloat that we can't afford to add any more bytes to the WAL record
header or heap WAL records, then it would be equally fruitful to look at
ways to use those padding bytes to save that precious WAL space.

I don't think we're *that* concerned about the WAL bloat, however. So
let's see what is the most sensible place to add whatever extra
information we need in the WAL, from the point of view of
maintainability, flexibility, readability etc. Then we can decide where
to put it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 18:56:35
Message-ID:	CA+U5nMJMbiCQCiVnbPSRbfoQc4qsJqVAmoFshau7MTSWZOmF=g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 02:45, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 20.06.2012 16:46, Simon Riggs wrote:
>>
>> The proposal now includes flag bits that would allow the addition of a
>> variable length header, should that ever become necessary. So the
>> unused space in the fixed header is not being "used up" as you say. In
>> any case, the fixed header still has 4 wasted bytes on 64bit systems
>> even after the patch is applied. So this claim of short sightedness is
>> just plain wrong.
>>
>> ...
>
>>
>>
>> We need to add information to every WAL record that is used as the
>> source for generating LCRs. It is also possible to add this to HEAP
>> and HEAP2 records, but doing that *will* bloat the WAL stream, whereas
>> using the *currently wasted* bytes on a WAL record header does *not*
>> bloat the WAL stream.
>

Wonderful ideas, these look good.

> Or, we could provide a mechanism for resource managers to use those padding
> bytes for whatever data the wish to use.

Sounds better to me.

> Or modify the record format so that
> the last 4 bytes of the data in the WAL record are always automatically
> stored in those padding bytes, thus making all WAL records 4 bytes shorter.
> That would make the WAL even more compact, with only a couple of extra CPU
> instructions in the critical path.

Sounds cool, but a little weird, even for me.

> My point is that it's wrong to think that it's free to use those bytes, just
> because they're currently unused. If we use them for one thing, we can't use
> them for other things anymore. If we're so concerned about WAL bloat that we
> can't afford to add any more bytes to the WAL record header or heap WAL
> records, then it would be equally fruitful to look at ways to use those
> padding bytes to save that precious WAL space.

Agreed. Thanks for sharing those ideas. Exactly why I like the list (really...)

> I don't think we're *that* concerned about the WAL bloat, however. So let's
> see what is the most sensible place to add whatever extra information we
> need in the WAL, from the point of view of maintainability, flexibility,
> readability etc. Then we can decide where to put it.

Removing FPW is still most important aspect there.

I think allowing rmgrs to redefine the wasted bytes in the header is
the best idea.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 19:11:08
Message-ID:	CA+U5nM+3N9-y0GNhkXOVj1TP1NUvr9r3gRgWaTdiFSjEcBi9Ew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 02:56, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> I think allowing rmgrs to redefine the wasted bytes in the header is
> the best idea.

Hmm, I think the best idea is to save 2 bytes off the WAL header for
all records, so there are no wasted bytes on 64bit or 32bit.

That way the potential for use goes away and there's benefit for all,
plus no argument about how to use those bytes in rarer cases.

I'll work on that.

And then we just put the originid on each heap record for MMR, in some
manner, discussed later.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 19:23:34
Message-ID:	4FE22336.7090509@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.06.2012 22:11, Simon Riggs wrote:
> On 21 June 2012 02:56, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>
>> I think allowing rmgrs to redefine the wasted bytes in the header is
>> the best idea.
>
> Hmm, I think the best idea is to save 2 bytes off the WAL header for
> all records, so there are no wasted bytes on 64bit or 32bit.
>
> That way the potential for use goes away and there's benefit for all,
> plus no argument about how to use those bytes in rarer cases.
>
> I'll work on that.

I don't think that's actually necessary, the WAL bloat isn't *that* bad
that we need to start shaving bytes from there. I was just trying to
make a point.

> And then we just put the originid on each heap record for MMR, in some
> manner, discussed later.

I reserve the right to object to that, too :-). Others raised the
concern that a 16-bit integer is not a very intuitive identifier. Also,
as discussed, for more complex scenarios just the originid is not
sufficient. ISTM that we need more flexibility.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, andres(at)2ndquadrant(dot)com, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 19:27:47
Message-ID:	CA+U5nMJ34RWxq6+DuPW8N3O=HB8K_wX38xtR8TGB6FeZo9h57g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 June 2012 03:23, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

>> And then we just put the originid on each heap record for MMR, in some
>> manner, discussed later.
>
>
> I reserve the right to object to that, too :-).

OK. But that would be only for MMR, using special record types.

> Others raised the concern
> that a 16-bit integer is not a very intuitive identifier.

Of course

> Also, as
> discussed, for more complex scenarios just the originid is not sufficient.
> ISTM that we need more flexibility.

Of course

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, robertmhaas(at)gmail(dot)com, daniel(at)heroku(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: [PATCH 10/16] Introduce the concept that wal has a 'origin' node
Date:	2012-06-20 19:32:39
Message-ID:	201206202132.40042.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday, June 20, 2012 09:23:34 PM Heikki Linnakangas wrote:
> > And then we just put the originid on each heap record for MMR, in some
> > manner, discussed later.
>
> I reserve the right to object to that, too :-). Others raised the
> concern that a 16-bit integer is not a very intuitive identifier. Also,
> as discussed, for more complex scenarios just the originid is not
> sufficient. ISTM that we need more flexibility.
I think the '16bit integer is unintiuitive' argument isn't that interesting.
As pointed out by multiple people in the thread that origin_id can be local
and mapped to something more complex in communication between the different
nodes and the configuration.
Before applying changes from another node you lookup their "complex id" into
the locally mapped 16bit origin_id which then gets written into the wal
stream. When decoding the wal stream into the LCR stream its mapped the other
way.

We might need more information than that at a later point but those probably
won't needed during low-level filtering of wal before reassembling it into
transactions...

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services