Re: Replication identifiers, take 3

Lists: pgsql-hackers
From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Replication identifiers, take 3
Date: 2014-09-23 18:24:22
Message-ID: 20140923182422.GA15776@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I've previously started two threads about replication identifiers. Check
http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
and
http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
.

The've also been discussed in the course of another thread:
http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

As the topic has garnered some heat and confusion I thought it'd be
worthwile to start afresh with an explanation why I think they're
useful.

I don't really want to discuss about implementation specifics for now,
but rather about (details of the) concept. Once we've hashed those out,
I'll adapt the existing patch to match them.

There are three primary use cases for replication identifiers:
1) The ability Monitor how for replication has progressed in a
crashsafe manner to allow it to restart at the right point after
errors/crashes.
2) Efficiently identify the origin of individual changes and
transactions. In multimaster and some cascading scenarios it is
necessary to do so to avoid sending out replayed changes again.
3) Allow to efficiently filter out replayed changes from logical
decoding. It's currently possible to filter changes from inside the
output plugin, but it's much more efficient to filter them out
before decoding.

== Logical Decoding Background ==

To understand the need for 1) it's important to roughly understand how
logical decoding/walsender streams changes and handles feedback from
the receiving side. A walsender performing logical decoding
*continously* sends out transactions. As long as there's new local
changes (i.e. unprocessed WAL) and the network buffers aren't full it
will send changes. *Without* waiting for the client. Everything else
would lead to horrible latency.

Because it sends data without waiting for the client to have processed
them it obviously can't remove resources that are needed to stream
them out again. The client or network connection could crash after
all.

To let the sender know when it can remove resources the receiver
regularly sends back 'feedback' messages acknowledging up to where
changes have been safely received. Whenever such a feedback message
arrives the sender can release resources that aren't needed to decode
the changes below that horizon.

When the receiver ask the server to stream changes out it tells the
sender at which LSN it should start sending changes. All
*transactions* that *commit* after that LSN are sent out. Possibly
again.

== Crashsafe apply ==

Based on those explanations, when building a logical replication
solution on top of logical decoding, one must remember the latest
*remote* LSN that already has been replayed. So that, when the apply
process or the whole database crashes, it is possibly to ask for all
changes since the last transaction that has been successfully applied.

The trivial solution here is to have a table (remote_node,
last_replayed_lsn) and update it for every replayed
transaction. Unfortunately that doesn't perform very well because that
table quickly gets heavily bloated. It's also hard to avoid page level
contention when replaying transaction from multiple remote
nodes. Additionally these changes have to be filtered out when
replicating these changes in a cascading fashion.

To do this more efficiently there needs to be a crashsafe way to
associate the latest successfully replayed remote transaction.

== Identify the origin of changes ==

Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.

To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.

If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.

It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.

== Proposed solution ==

These two fundamental problems happen to have overlapping
requirements.

A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).

During replay that table is kept up2date in sync with transaction
commits. If updated within the transaction commit's critical section
it's guaranteed to be correct, even if transactions can abort due to
constraint violations and such. When the cluster crashes it can be
rebuilt during crash recovery, by updating values whenever a commit
record is read.

The primary complexity here is that the identification of the
'origin' node should be as small as possible to keep the WAL volume
down.

Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.

Again the primary problem here is how to efficiently identify origin
nodes.

== Replication Identifiers ==

The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.

So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.

In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.

Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.

What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.

There's a example of how this can be used from SQL at
http://archives.postgresql.org/message-id/20131114172632.GE7522@alap2.anarazel.de
That version is built on top of commit timestamps, but that only shows
because pg_replication_identifier_setup_tx_origin() allows to set the
source transaction's timestamp.

With that, far too long, explanation, is it clearer what I think
replication identifiers are for? What's your thougts?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 02:44:49
Message-ID: CA+TgmoYifqWdUvgPw+ND3qEUPSXB4WV513_OmaVXMzdfUL5LFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for this write-up.

On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> 1) The ability Monitor how for replication has progressed in a
> crashsafe manner to allow it to restart at the right point after
> errors/crashes.
> 2) Efficiently identify the origin of individual changes and
> transactions. In multimaster and some cascading scenarios it is
> necessary to do so to avoid sending out replayed changes again.
> 3) Allow to efficiently filter out replayed changes from logical
> decoding. It's currently possible to filter changes from inside the
> output plugin, but it's much more efficient to filter them out
> before decoding.

I agree with these goals.

Let me try to summarize the information requirements for each of these
things. For #1, you need to know, after crash recovery, for each
standby, the last commit LSN which the client has confirmed via a
feedback message. For #2, you need to know, when decoding each
change, what the origin node was. And for #3, you need to know, when
decoding each change, whether it is of local origin. The information
requirements for #3 are a subset of those for #2.

> A rather efficient solution for 1) is to attach the 'origin node' and
> the remote commit LSN to every local commit record produced by
> replay. That allows to have a shared memory "table" (remote_node,
> local_lsn, remote_lsn).

This seems OK to me, modulo some arguing about what the origin node
information ought to look like. People who are not using logical
replication can use the compact form of the commit record in most
cases, and people who are using logical replication can pay for it.

> Similarly, to solve the problem of identifying the origin of changes
> during decoding, the problem can be solved nicely by adding the origin
> node of every change to changes/transactions. At first it might seem
> to be sufficient to do so on transaction level, but for cascading
> scenarios it's very useful to be able to have changes from different
> source transactions combinded into a larger one.

I think this is a lot more problematic. I agree that having the data
in the commit record isn't sufficient here, because for filtering
purposes (for example) you really want to identify the problematic
transactions at the beginning, so you can chuck their WAL, rather than
reassembling the transaction first and then throwing it out. But
putting the origin ID in every insert/update/delete is pretty
unappealing from my point of view - not just because it adds bytes to
WAL, though that's a non-trivial concern, but also because it adds
complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
happier with a solution where, say, we have a separate WAL record that
says "XID 1234 will be decoding for origin 567 until further notice".

> == Replication Identifiers ==
>
> The above explains the need to have as small as possible identifiers
> for nodes. Two years back I'd suggested that we rely on the user to
> manually assign 16bit ids to individual nodes. Not very surprisingly
> that was shot down because a) 16bit numbers are not descriptive b) a
> per node identifier is problematic because it prohibits replication
> inside the same cluster.
>
> So, what I've proposed since is to have two different forms of
> identifiers. A long one, that's as descriptive as
> $replication_solution wants. And a small one (16bit in my case) that's
> *only meaningful within one node*. The long, user facing, identifier
> is locally mapped to the short one.
>
> In the first version I proposed these long identifiers had a fixed
> form, including the system identifier, timeline id, database id, and a
> freeform name. That wasn't well received and I agree that that's too
> narrow. I think it should be a freeform text of arbitrary length.
>
> Note that it depends on the replication solution whether these
> external identifiers need to be coordinated across systems or not. I
> think it's *good* if we don't propose a solution for that - different
> replication solutions will have different requirements.

I'm pretty fuzzy on how this actually works. Like, the short form
here is just getting injected into WAL by the apply process. How does
it figure out what value to inject? What if it injects a value that
doesn't have a short-to-long mapping? What's the point of the
short-to-long mappings in the first place? Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 09:05:53
Message-ID: 20140926090553.GF1169@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-25 22:44:49 -0400, Robert Haas wrote:
> Thanks for this write-up.
>
> On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > 1) The ability Monitor how for replication has progressed in a
> > crashsafe manner to allow it to restart at the right point after
> > errors/crashes.
> > 2) Efficiently identify the origin of individual changes and
> > transactions. In multimaster and some cascading scenarios it is
> > necessary to do so to avoid sending out replayed changes again.
> > 3) Allow to efficiently filter out replayed changes from logical
> > decoding. It's currently possible to filter changes from inside the
> > output plugin, but it's much more efficient to filter them out
> > before decoding.
>
> I agree with these goals.
>
> Let me try to summarize the information requirements for each of these
> things. For #1, you need to know, after crash recovery, for each
> standby, the last commit LSN which the client has confirmed via a
> feedback message.

I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.

Say you replay the following:
SET synchronous_commit = off;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/10 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/20 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/30 */;

If postgres crashes at any point during this, we need to know whether we
successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem
exists independent of s_c=off, it just excerbates the issue.

Then, after finishing recovery and discovering only 0/10 has persisted,
the standby can reconnect to the primary and do
START_REPLICATION SLOT .. LOGICAL 0/10;
and it'll receive all transactions that have committed since 0/10.

> For #2, you need to know, when decoding each
> change, what the origin node was. And for #3, you need to know, when
> decoding each change, whether it is of local origin. The information
> requirements for #3 are a subset of those for #2.

Right. For #3 it's more important to have the information available
efficiently on individual records.

> > A rather efficient solution for 1) is to attach the 'origin node' and
> > the remote commit LSN to every local commit record produced by
> > replay. That allows to have a shared memory "table" (remote_node,
> > local_lsn, remote_lsn).
>
> This seems OK to me, modulo some arguing about what the origin node
> information ought to look like. People who are not using logical
> replication can use the compact form of the commit record in most
> cases, and people who are using logical replication can pay for it.

Exactly.

> > Similarly, to solve the problem of identifying the origin of changes
> > during decoding, the problem can be solved nicely by adding the origin
> > node of every change to changes/transactions. At first it might seem
> > to be sufficient to do so on transaction level, but for cascading
> > scenarios it's very useful to be able to have changes from different
> > source transactions combinded into a larger one.
>
> I think this is a lot more problematic. I agree that having the data
> in the commit record isn't sufficient here, because for filtering
> purposes (for example) you really want to identify the problematic
> transactions at the beginning, so you can chuck their WAL, rather than
> reassembling the transaction first and then throwing it out. But
> putting the origin ID in every insert/update/delete is pretty
> unappealing from my point of view - not just because it adds bytes to
> WAL, though that's a non-trivial concern, but also because it adds
> complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
> happier with a solution where, say, we have a separate WAL record that
> says "XID 1234 will be decoding for origin 567 until further notice".

I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.

I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.

> > == Replication Identifiers ==
> >
> > The above explains the need to have as small as possible identifiers
> > for nodes. Two years back I'd suggested that we rely on the user to
> > manually assign 16bit ids to individual nodes. Not very surprisingly
> > that was shot down because a) 16bit numbers are not descriptive b) a
> > per node identifier is problematic because it prohibits replication
> > inside the same cluster.
> >
> > So, what I've proposed since is to have two different forms of
> > identifiers. A long one, that's as descriptive as
> > $replication_solution wants. And a small one (16bit in my case) that's
> > *only meaningful within one node*. The long, user facing, identifier
> > is locally mapped to the short one.
> >
> > In the first version I proposed these long identifiers had a fixed
> > form, including the system identifier, timeline id, database id, and a
> > freeform name. That wasn't well received and I agree that that's too
> > narrow. I think it should be a freeform text of arbitrary length.
> >
> > Note that it depends on the replication solution whether these
> > external identifiers need to be coordinated across systems or not. I
> > think it's *good* if we don't propose a solution for that - different
> > replication solutions will have different requirements.
>
> I'm pretty fuzzy on how this actually works. Like, the short form
> here is just getting injected into WAL by the apply process. How does
> it figure out what value to inject?

Thy apply process once does
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
that looks up the internal identifier and stores it in a global
variable. That's then filled in struct XLogRecord.
To setup the origin LSN of a transaction
SELECT pg_replication_identifier_setup_tx_origin('0/123456', '2013-12-11 15:14:59.219737+01')
is used. If setup that'll emit the 'extended' commit record with the
remote commit LSN.

> What if it injects a value that doesn't have a short-to-long mapping?

Shouldn't be possible unless you drop a replication identifier after it
has been setup by *_replaying_from(). If we feel that's a actually
dangerous scenario we can prohibit it with a session level lock.

> What's the point of the short-to-long mappings in the first place? Is
> that only required because of the possibility that there might be
> multiple replication solutions in play on the same node?

In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 09:15:08
Message-ID: 54252E9C.4060402@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26/09/14 04:44, Robert Haas wrote:
> On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>
>> Note that it depends on the replication solution whether these
>> external identifiers need to be coordinated across systems or not. I
>> think it's *good* if we don't propose a solution for that - different
>> replication solutions will have different requirements.
>
> I'm pretty fuzzy on how this actually works. Like, the short form
> here is just getting injected into WAL by the apply process. How does
> it figure out what value to inject? What if it injects a value that
> doesn't have a short-to-long mapping? What's the point of the
> short-to-long mappings in the first place? Is that only required
> because of the possibility that there might be multiple replication
> solutions in play on the same node?
>

From my perspective the short-to-long mapping is mainly convenience
thing, long id should be something that can be used to map the
identifier to the specific node for the purposes of configuration,
monitoring, troubleshooting, etc. You also usually don't use just Oids
to represent the DB objects, I see some analogy there.
This could be potentially done by the solution itself, not by the
framework, but it seems logical (pardon the pun) that most (if not all)
solutions will want some kind of mapping of the generated ids to
something that represents the logical node.

So answer to your first two questions depends on the specific solution,
it can map it from connection configuration, it can get the it from the
output plugin as part of wire protocol, it can generate it based on some
internal logic, etc.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 13:53:09
Message-ID: CA+TgmoZOhCT9_ENZ165=3pRKH+p=Ads8g5E3voM8sSufW_cYYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Let me try to summarize the information requirements for each of these
>> things. For #1, you need to know, after crash recovery, for each
>> standby, the last commit LSN which the client has confirmed via a
>> feedback message.
>
> I'm not sure I understand what you mean here? This is all happening on
> the *standby*. The standby needs to know, after crash recovery, the
> latest commit LSN from the primary that it has successfully replayed.

Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.

>> > Similarly, to solve the problem of identifying the origin of changes
>> > during decoding, the problem can be solved nicely by adding the origin
>> > node of every change to changes/transactions. At first it might seem
>> > to be sufficient to do so on transaction level, but for cascading
>> > scenarios it's very useful to be able to have changes from different
>> > source transactions combinded into a larger one.
>>
>> I think this is a lot more problematic. I agree that having the data
>> in the commit record isn't sufficient here, because for filtering
>> purposes (for example) you really want to identify the problematic
>> transactions at the beginning, so you can chuck their WAL, rather than
>> reassembling the transaction first and then throwing it out. But
>> putting the origin ID in every insert/update/delete is pretty
>> unappealing from my point of view - not just because it adds bytes to
>> WAL, though that's a non-trivial concern, but also because it adds
>> complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
>> happier with a solution where, say, we have a separate WAL record that
>> says "XID 1234 will be decoding for origin 567 until further notice".
>
> I think it actually ends up much simpler than what you propose. In the
> apply process, you simply execute
> SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
> or it's C equivalent. That sets a global variable which XLogInsert()
> includes the record.
> Note that this doesn't actually require any additional space in the WAL
> - padding bytes in struct XLogRecord are used to store the
> identifier. These have been unused at least since 8.0.

Sure, that's simpler for logical decoding, for sure. That doesn't
make it the right decision for the system overall.

> I don't think a solution which logs the change of origin will be
> simpler. When the origin is in every record, you can filter without keep
> track of any state. That's different if you can switch the origin per
> tx. At the very least you need a in memory entry for the origin.

But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care. You know me: I've
been a huge advocate of logical decoding. But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system. I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations. It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back. And what if Tom gets around, some day, to doing that pluggable
heap AM work? Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.

Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched. And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space. Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.

>> What's the point of the short-to-long mappings in the first place? Is
>> that only required because of the possibility that there might be
>> multiple replication solutions in play on the same node?
>
> In my original proposal, 2 years+ back, I only used short numeric
> ids. And people didn't like it because it requires coordination between
> the replication solutions and possibly between servers. Using a string
> identifier like in the above allows to easily build unique names; and
> allows every solution to add the information it needs into replication
> identifiers.

I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis. I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system. But I want to confirm if that's actually what you're
thinking.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 14:21:47
Message-ID: 20140926142147.GL1169@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Let me try to summarize the information requirements for each of these
> >> things. For #1, you need to know, after crash recovery, for each
> >> standby, the last commit LSN which the client has confirmed via a
> >> feedback message.
> >
> > I'm not sure I understand what you mean here? This is all happening on
> > the *standby*. The standby needs to know, after crash recovery, the
> > latest commit LSN from the primary that it has successfully replayed.
>
> Ah, sorry, you're right: so, you need to know, after crash recovery,
> for each machine you are replicating *from*, the last transaction (in
> terms of LSN) from that server that you successfully replayed.

Precisely.

> > I don't think a solution which logs the change of origin will be
> > simpler. When the origin is in every record, you can filter without keep
> > track of any state. That's different if you can switch the origin per
> > tx. At the very least you need a in memory entry for the origin.
>
> But again, that complexity pertains only to logical decoding.

> Somebody who wants to tweak the WAL format for an UPDATE in the future
> doesn't need to understand how this works, or care.

I agree that that's a worthy goal. But I don't see how this isn't the
case with what I propose? This isn't happening on the level of
individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
in struct XLogRecord for a long time.

There's also the significant advantage that not basing this on the xid
allows it to work correctly with records not tied to a
transaction. There's not that much of that happening yet, but I've
several features in mind:

* separately replicate 2PC commits. 2PC commits don't have an xid
anymore... With some tooling on top replication 2PC in two phases
allow for very cool stuff. Like optionally synchronous multimaster
replication.
* I have a pending patch that allows to send 'messages' through logical
decoding - yielding a really fast and persistent queue. For that it's
useful have transactional *and* nontransactional messages.
* Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
the xid.

The absolutely, super, uber most convincing reason is:
It's trivial to build tools to analyze how much WAL traffic is generated
by which replication stream and how much by originates locally. A
pg_xlogdump --stats=replication_identifier wouldn't be hard ;)

> You know me: I've
> been a huge advocate of logical decoding. But just like row-level
> security or BRIN indexes or any other feature, I think it needs to be
> designed in a way that minimizes the impact it has on the rest of the
> system.

Totally agreed. And that always will take some arguing...

> I simply don't believe your contention that this isn't adding
> any complexity to the code path for regular DML operations. It's
> entirely possible we could need bit space in those records in the
> future for something that actually pertains to those operations; if
> you've burned it for logical decoding, it'll be difficult to claw it
> back. And what if Tom gets around, some day, to doing that pluggable
> heap AM work? Then every heap AM has got to allow for those bits, and
> maybe that doesn't happen to be free for them.

As explained above this isn't happening on the level of individual AMs.

> Admittedly, these are hypothetical scenarios, but I don't think
> they're particularly far-fetched. And as a fringe benefit, if you do
> it the way that I'm proposing, you can use an OID instead of a 16-bit
> thing that we picked to be 16 bits because that happens to be 100% of
> the available bit-space. Yeah, there's some complexity on decoding,
> but it's minimal: one more piece of fixed-size state to track per XID.
> That's trivial compared to what you've already got.

But it forces you to track the xids/transactions. With my proposal you
can ignore transaction *entirely* unless they manipulate the
catalog. For concurrent OLTP workloads that's quite the advantage.

> >> What's the point of the short-to-long mappings in the first place? Is
> >> that only required because of the possibility that there might be
> >> multiple replication solutions in play on the same node?
> >
> > In my original proposal, 2 years+ back, I only used short numeric
> > ids. And people didn't like it because it requires coordination between
> > the replication solutions and possibly between servers. Using a string
> > identifier like in the above allows to easily build unique names; and
> > allows every solution to add the information it needs into replication
> > identifiers.
>
> I get that, but what I'm asking is why those mappings can't be managed
> on a per-replication-solution basis. I think that's just because
> there's a limited namespace and so coordination is needed between
> multiple replication solutions that might possibly be running on the
> same system. But I want to confirm if that's actually what you're
> thinking.

Yes, that and that such a mapping needs to be done across all database
are the primary reasons. As it's currently impossible to create further
shared relations you'd have to invent something living in the data
directory on filesystem level... Brr.

I think it'd also be much worse for debugging if there'd be no way to
map such a internal identifier back to the replication solution in some
form.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 14:40:37
Message-ID: CA+TgmoYn873dx+QZSoD9ZPHNBVnueiAcJkHs70hXcKOgnZ+UMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> As explained above this isn't happening on the level of individual AMs.

Well, that's even worse. You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).

>> I get that, but what I'm asking is why those mappings can't be managed
>> on a per-replication-solution basis. I think that's just because
>> there's a limited namespace and so coordination is needed between
>> multiple replication solutions that might possibly be running on the
>> same system. But I want to confirm if that's actually what you're
>> thinking.
>
> Yes, that and that such a mapping needs to be done across all database
> are the primary reasons. As it's currently impossible to create further
> shared relations you'd have to invent something living in the data
> directory on filesystem level... Brr.
>
> I think it'd also be much worse for debugging if there'd be no way to
> map such a internal identifier back to the replication solution in some
> form.

OK.

One question I have is what the structure of the names should be. It
seems some coordination could be needed here. I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
we should decide that names ought to be of the form
<replication-solution>.<further-period-separated-components> or
something like that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 14:55:47
Message-ID: 20140926145547.GC7550@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > As explained above this isn't happening on the level of individual AMs.
>
> Well, that's even worse. You want to grab 100% of the available
> generic bitspace applicable to all record types for purposes specific
> to logical decoding (and it's still not really enough bits).

I don't think that's a fair characterization. Right now it's available
to precisely nobody. You can't put any data in there in *any* way. It
just has been sitting around unused for at least 8 years.

> One question I have is what the structure of the names should be. It
> seems some coordination could be needed here. I mean, suppose BDR
> uses bdr:$NODENAME and Slony uses
> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
> server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
> we should decide that names ought to be of the form
> <replication-solution>.<further-period-separated-components> or
> something like that.

I've also wondered about that. Perhaps we simply should have an
additional 'name' column indicating the replication solution?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 15:02:16
Message-ID: CA+TgmoYWKq7NCM-3_v9__w75ucXmtMfmhz1Ysn_0j9tpcupkLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
>> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > As explained above this isn't happening on the level of individual AMs.
>>
>> Well, that's even worse. You want to grab 100% of the available
>> generic bitspace applicable to all record types for purposes specific
>> to logical decoding (and it's still not really enough bits).
>
> I don't think that's a fair characterization. Right now it's available
> to precisely nobody. You can't put any data in there in *any* way. It
> just has been sitting around unused for at least 8 years.

Huh? That's just to say that the unused bit space is, in fact,
unused. But so what? We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.

>> One question I have is what the structure of the names should be. It
>> seems some coordination could be needed here. I mean, suppose BDR
>> uses bdr:$NODENAME and Slony uses
>> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
>> server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
>> we should decide that names ought to be of the form
>> <replication-solution>.<further-period-separated-components> or
>> something like that.
>
> I've also wondered about that. Perhaps we simply should have an
> additional 'name' column indicating the replication solution?

Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name. Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 16:32:10
Message-ID: 20140926163210.GD7550@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-26 11:02:16 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
> >> On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> > As explained above this isn't happening on the level of individual AMs.
> >>
> >> Well, that's even worse. You want to grab 100% of the available
> >> generic bitspace applicable to all record types for purposes specific
> >> to logical decoding (and it's still not really enough bits).
> >
> > I don't think that's a fair characterization. Right now it's available
> > to precisely nobody. You can't put any data in there in *any* way. It
> > just has been sitting around unused for at least 8 years.
>
> Huh? That's just to say that the unused bit space is, in fact,
> unused. But so what? We've always been very careful about using up
> things like infomask bits, because there are only so many bits
> available, and when they're gone they are gone.

I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.

I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.

> >> One question I have is what the structure of the names should be. It
> >> seems some coordination could be needed here. I mean, suppose BDR
> >> uses bdr:$NODENAME and Slony uses
> >> $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
> >> server uses xdb__$NODE_NAME. That seems like it would be sad. Maybe
> >> we should decide that names ought to be of the form
> >> <replication-solution>.<further-period-separated-components> or
> >> something like that.
> >
> > I've also wondered about that. Perhaps we simply should have an
> > additional 'name' column indicating the replication solution?
>
> Yeah, maybe, but there's still the question of substructure within the
> non-replication-solution part of the name. Not sure if we can assume
> that a bipartite identifier, specifically, is right, or whether some
> solutions will end up with different numbers of components.

Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 18:57:12
Message-ID: CA+TgmoapkG7mT=7VdFD3Az8dHQpQdRi2d7Q6=edjprPAnWBbgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Huh? That's just to say that the unused bit space is, in fact,
>> unused. But so what? We've always been very careful about using up
>> things like infomask bits, because there are only so many bits
>> available, and when they're gone they are gone.
>
> I don't think that's a very meaningful comparison. The problem with
> infomask bits is that it's impossible to change anything once added
> because of pg_upgrade'ability. That problem does not exist for
> XLogRecord. We've twiddled with the WAL format pretty much in every
> release. We can reconsider every release.
>
> I can't remember anyone but me thinking about using these two bytes. So
> the comparison here really is using two free bytes vs. issuing at least
> ~30 (record + origin) for every replayed transaction. Don't think that's
> a unfair tradeof.

Mmph. You have a point about the WAL format being easier to change.
"Reconsidering", though, would mean that some developer who probably
isn't you needs those bytes for something that really is a more
general need than this, so they write a patch to get them back by
doing what I proposed - and then it gets rejected because it's not as
good for logical replication. So I'm not sure I really buy this as an
argument. For all practical purposes, if you grab them, they'll be
gone.

>> > I've also wondered about that. Perhaps we simply should have an
>> > additional 'name' column indicating the replication solution?
>>
>> Yeah, maybe, but there's still the question of substructure within the
>> non-replication-solution part of the name. Not sure if we can assume
>> that a bipartite identifier, specifically, is right, or whether some
>> solutions will end up with different numbers of components.
>
> Ah. I thought you only wanted to suggest a separator between the
> replication solution and it's internal dat. But you actually want to
> suggest an internal separator to be used in the solution's namespace?
> I'm fine with that. I don't think we can suggest much beyond that -
> different solutions will have fundamentally differing requirements about
> which information to store.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 22:05:08
Message-ID: 20140926220508.GF7550@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Huh? That's just to say that the unused bit space is, in fact,
> >> unused. But so what? We've always been very careful about using up
> >> things like infomask bits, because there are only so many bits
> >> available, and when they're gone they are gone.
> >
> > I don't think that's a very meaningful comparison. The problem with
> > infomask bits is that it's impossible to change anything once added
> > because of pg_upgrade'ability. That problem does not exist for
> > XLogRecord. We've twiddled with the WAL format pretty much in every
> > release. We can reconsider every release.
> >
> > I can't remember anyone but me thinking about using these two bytes. So
> > the comparison here really is using two free bytes vs. issuing at least
> > ~30 (record + origin) for every replayed transaction. Don't think that's
> > a unfair tradeof.
>
> Mmph. You have a point about the WAL format being easier to change.
> "Reconsidering", though, would mean that some developer who probably
> isn't you needs those bytes for something that really is a more
> general need than this, so they write a patch to get them back by
> doing what I proposed - and then it gets rejected because it's not as
> good for logical replication. So I'm not sure I really buy this as an
> argument. For all practical purposes, if you grab them, they'll be
> gone.

Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying "claim" on them but "me". I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused. And
there's actually four more consecutive bytes on 64bit systems that are
unused.

Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.

> >> > I've also wondered about that. Perhaps we simply should have an
> >> > additional 'name' column indicating the replication solution?
> >>
> >> Yeah, maybe, but there's still the question of substructure within the
> >> non-replication-solution part of the name. Not sure if we can assume
> >> that a bipartite identifier, specifically, is right, or whether some
> >> solutions will end up with different numbers of components.
> >
> > Ah. I thought you only wanted to suggest a separator between the
> > replication solution and it's internal dat. But you actually want to
> > suggest an internal separator to be used in the solution's namespace?
> > I'm fine with that. I don't think we can suggest much beyond that -
> > different solutions will have fundamentally differing requirements about
> > which information to store.
>
> Agreed.

So, let's recommend underscores as that separator?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-27 16:11:16
Message-ID: BLU436-SMTP1177428C39BCCF07635FE1DDCBC0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2014 06:05 PM, Andres Freund wrote:
> On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> Sure, it'll possibly not be trivial to move them elsewhere. On the other
> hand, the padding bytes have been unused for 8+ years without somebody
> laying "claim" on them but "me". I don't think it's a good idea to leave
> them there unused when nobody even has proposed another use for a long
> while. That'll just end up with them continuing to be unused. And
> there's actually four more consecutive bytes on 64bit systems that are
> unused.
>
> Should there really be a dire need after that, we can simply bump the
> record size. WAL volume wise it'd not be too bad to make the record a
> tiny bit bigger - the header is only a relatively small fraction of the
> entire content.

If we were now increasing the WAL record size anyway for some unrelated
reason, would we be willing to increase it by a further 2 bytes for the
node identifier?
If the answer is 'no' then I don't think we can justify using the 2
padding bytes just because they are there and have been unused for
years. But if the answer is yes, we feel this important enough to
justfiy a slightly (2 byte) larger WAL record header then we shouldn't
use the excuse of maybe needing those 2 bytes for something else. When
something else comes along that needs the WAL space we'll have to
increase the record size.

To say that if some other patch comes along that needs the space we'll
redo this feature to use the method Robert describes is unrealistic. If
we think that the replication identifier isn't
general/important/necessary to justify 2 bytes of WAL header space then
we should start out with something that doesn't use the WAL header,

Steve

>>>>> I've also wondered about that. Perhaps we simply should have an
>>>>> additional 'name' column indicating the replication solution?
>>>> Yeah, maybe, but there's still the question of substructure within the
>>>> non-replication-solution part of the name. Not sure if we can assume
>>>> that a bipartite identifier, specifically, is right, or whether some
>>>> solutions will end up with different numbers of components.
>>> Ah. I thought you only wanted to suggest a separator between the
>>> replication solution and it's internal dat. But you actually want to
>>> suggest an internal separator to be used in the solution's namespace?
>>> I'm fine with that. I don't think we can suggest much beyond that -
>>> different solutions will have fundamentally differing requirements about
>>> which information to store.
>> Agreed.
> So, let's recommend underscores as that separator?
>
> Greetings,
>
> Andres Freund
>


From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-27 16:16:16
Message-ID: BLU436-SMTP18739974A640974EFAD403FDCBC0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/26/2014 10:21 AM, Andres Freund wrote:
> On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
>> On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>>> Let me try to summarize the information requirements for each of these
>>>> things. For #1, you need to know, after crash recovery, for each
>>>> standby, the last commit LSN which the client has confirmed via a
>>>> feedback message.
>>> I'm not sure I understand what you mean here? This is all happening on
>>> the *standby*. The standby needs to know, after crash recovery, the
>>> latest commit LSN from the primary that it has successfully replayed.
>> Ah, sorry, you're right: so, you need to know, after crash recovery,
>> for each machine you are replicating *from*, the last transaction (in
>> terms of LSN) from that server that you successfully replayed.
> Precisely.
>
>>> I don't think a solution which logs the change of origin will be
>>> simpler. When the origin is in every record, you can filter without keep
>>> track of any state. That's different if you can switch the origin per
>>> tx. At the very least you need a in memory entry for the origin.
>> But again, that complexity pertains only to logical decoding.
>> Somebody who wants to tweak the WAL format for an UPDATE in the future
>> doesn't need to understand how this works, or care.
> I agree that that's a worthy goal. But I don't see how this isn't the
> case with what I propose? This isn't happening on the level of
> individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
> in struct XLogRecord for a long time.
>
> There's also the significant advantage that not basing this on the xid
> allows it to work correctly with records not tied to a
> transaction. There's not that much of that happening yet, but I've
> several features in mind:
>
> * separately replicate 2PC commits. 2PC commits don't have an xid
> anymore... With some tooling on top replication 2PC in two phases
> allow for very cool stuff. Like optionally synchronous multimaster
> replication.
> * I have a pending patch that allows to send 'messages' through logical
> decoding - yielding a really fast and persistent queue. For that it's
> useful have transactional *and* nontransactional messages.
> * Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
> the xid.

At what point in the decoding stream should something related to a
CONCURRENTLY command show up?
Also, for a logical message queue why couldn't you have a xid associated
with the message that had nothing else in the transaction?

l


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-29 08:15:39
Message-ID: 20140929081539.GA14652@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-27 12:11:16 -0400, Steve Singer wrote:
> On 09/26/2014 06:05 PM, Andres Freund wrote:
> >On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
> >Sure, it'll possibly not be trivial to move them elsewhere. On the other
> >hand, the padding bytes have been unused for 8+ years without somebody
> >laying "claim" on them but "me". I don't think it's a good idea to leave
> >them there unused when nobody even has proposed another use for a long
> >while. That'll just end up with them continuing to be unused. And
> >there's actually four more consecutive bytes on 64bit systems that are
> >unused.
> >
> >Should there really be a dire need after that, we can simply bump the
> >record size. WAL volume wise it'd not be too bad to make the record a
> >tiny bit bigger - the header is only a relatively small fraction of the
> >entire content.
>
> If we were now increasing the WAL record size anyway for some unrelated
> reason, would we be willing to increase it by a further 2 bytes for the node
> identifier?
> If the answer is 'no' then I don't think we can justify using the 2 padding
> bytes just because they are there and have been unused for years. But if
> the answer is yes, we feel this important enough to justfiy a slightly (2
> byte) larger WAL record header then we shouldn't use the excuse of maybe
> needing those 2 bytes for something else. When something else comes along
> that needs the WAL space we'll have to increase the record size.

I don't think that's a good way to see this. By that argument these
bytes will never be used.

Also there's four more free bytes on 64bit systems...

> To say that if some other patch comes along that needs the space we'll redo
> this feature to use the method Robert describes is unrealistic. If we think
> that the replication identifier isn't general/important/necessary to
> justify 2 bytes of WAL header space then we should start out with something
> that doesn't use the WAL header,

Maintaining complexity also has its costs. And I think that's much more
concrete than some imaginary feature (of which nothing was heard for the
last 8+ years) also needing two bytes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-29 14:59:20
Message-ID: CA+TgmoYgEALOrWajg9vhx9ymxh8ykkKm5svD3d_sdvwO-DrBmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 27, 2014 at 12:11 PM, Steve Singer <steve(at)ssinger(dot)info> wrote:
> If we were now increasing the WAL record size anyway for some unrelated
> reason, would we be willing to increase it by a further 2 bytes for the node
> identifier?

Obviously not. Otherwise Andres would be proposing to put an OID in
there instead of a kooky 16-bit identifier.

> If the answer is 'no' then I don't think we can justify using the 2 padding
> bytes just because they are there and have been unused for years. But if
> the answer is yes, we feel this important enough to justfiy a slightly (2
> byte) larger WAL record header then we shouldn't use the excuse of maybe
> needing those 2 bytes for something else. When something else comes along
> that needs the WAL space we'll have to increase the record size.
>
> To say that if some other patch comes along that needs the space we'll redo
> this feature to use the method Robert describes is unrealistic. If we think
> that the replication identifier isn't general/important/necessary to
> justify 2 bytes of WAL header space then we should start out with something
> that doesn't use the WAL header,

I lean in that direction too, but would welcome more input from others.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-10-02 08:49:31
Message-ID: 542D119B.2070704@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/23/2014 09:24 PM, Andres Freund wrote:
> I've previously started two threads about replication identifiers. Check
> http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
> and
> http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
> .
>
> The've also been discussed in the course of another thread:
> http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de

And even earlier here:
http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres(at)2ndquadrant(dot)com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com
The thread branched a lot, the relevant branch is the one with subject
"[PATCH 10/16] Introduce the concept that wal has a 'origin' node"

> == Identify the origin of changes ==
>
> Say you're building a replication solution that allows two nodes to
> insert into the same table on two nodes. Ignoring conflict resolution
> and similar fun, one needs to prevent the same change being replayed
> over and over. In logical replication the changes to the heap have to
> be WAL logged, and thus the *replay* of changes from a remote node
> produce WAL which then will be decoded again.
>
> To avoid that it's very useful to tag individual changes/transactions
> with their 'origin'. I.e. mark changes that have been directly
> triggered by the user sending SQL as originating 'locally' and changes
> originating from replaying another node's changes as originating
> somewhere else.
>
> If that origin is exposed to logical decoding output plugins they can
> easily check whether to stream out the changes/transactions or not.
>
>
> It is possible to do this by adding extra columns to every table and
> store the origin of a row in there, but that a) permanently needs
> storage b) makes things much more invasive.

An origin column in the table itself helps tremendously to debug issues
with the replication system. In many if not most scenarios, I think
you'd want to have that extra column, even if it's not strictly required.

> What I've previously suggested (and which works well in BDR) is to add
> the internal id to the XLogRecord struct. There's 2 free bytes of
> padding that can be used for that purpose.

Adding a field to XLogRecord for this feels wrong. This is for *logical*
replication - why do you need to mess with something as physical as the
WAL record format?

And who's to say that a node ID is the most useful piece of information
for a replication system to add to the WAL header. I can easily imagine
that you'd want to put a changeset ID or something else in there,
instead. (I mentioned another example of this in
http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com)

If we need additional information added to WAL records, for extensions,
then that should be made in an extensible fashion. IIRC (I couldn't find
a link right now), when we discussed the changes to heap_insert et al
for wal_level=logical, I already argued back then that we should make it
possible for extensions to annotate WAL records, with things like "this
is the primary key", or whatever information is needed for conflict
resolution, or handling loops. I don't like it that we're adding little
pieces of information to the WAL format, bit by bit.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 3
Date: 2014-10-02 09:30:06
Message-ID: 20141002093006.GG7158@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-02 11:49:31 +0300, Heikki Linnakangas wrote:
> On 09/23/2014 09:24 PM, Andres Freund wrote:
> >I've previously started two threads about replication identifiers. Check
> >http://archives.postgresql.org/message-id/20131114172632.GE7522%40alap2.anarazel.de
> >and
> >http://archives.postgresql.org/message-id/20131211153833.GB25227%40awork2.anarazel.de
> >.
> >
> >The've also been discussed in the course of another thread:
> >http://archives.postgresql.org/message-id/20140617165011.GA3115%40awork2.anarazel.de
>
> And even earlier here:
> http://www.postgresql.org/message-id/flat/1339586927-13156-10-git-send-email-andres(at)2ndquadrant(dot)com#1339586927-13156-10-git-send-email-andres@2ndquadrant.com
> The thread branched a lot, the relevant branch is the one with subject
> "[PATCH 10/16] Introduce the concept that wal has a 'origin' node"

Right. Long time ago already ;)

> >== Identify the origin of changes ==
> >
> >Say you're building a replication solution that allows two nodes to
> >insert into the same table on two nodes. Ignoring conflict resolution
> >and similar fun, one needs to prevent the same change being replayed
> >over and over. In logical replication the changes to the heap have to
> >be WAL logged, and thus the *replay* of changes from a remote node
> >produce WAL which then will be decoded again.
> >
> >To avoid that it's very useful to tag individual changes/transactions
> >with their 'origin'. I.e. mark changes that have been directly
> >triggered by the user sending SQL as originating 'locally' and changes
> >originating from replaying another node's changes as originating
> >somewhere else.
> >
> >If that origin is exposed to logical decoding output plugins they can
> >easily check whether to stream out the changes/transactions or not.
> >
> >
> >It is possible to do this by adding extra columns to every table and
> >store the origin of a row in there, but that a) permanently needs
> >storage b) makes things much more invasive.
>
> An origin column in the table itself helps tremendously to debug issues with
> the replication system. In many if not most scenarios, I think you'd want to
> have that extra column, even if it's not strictly required.

I don't think you'll have much success convincing actual customers of
that. It's one thing to increase the size of the WAL stream a bit, it's
entirely different to persistently increase the table size of all their
tables.

> >What I've previously suggested (and which works well in BDR) is to add
> >the internal id to the XLogRecord struct. There's 2 free bytes of
> >padding that can be used for that purpose.
>
> Adding a field to XLogRecord for this feels wrong. This is for *logical*
> replication - why do you need to mess with something as physical as the WAL
> record format?

XLogRecord isn't all that "physical". It doesn't encode anything in that
regard but the fact that there's backup blocks in the record. It's
essentially just an implementation detail of logging. Whether that's
physical or logical doesn't really matter much.

There's basically two primary reasons I think it's a good idea to add it
there:

a) There's many different type of records where it's useful to add the
origin. Adding the information to all these will make things more
complicated, using more space, and be more fragile. And I'm pretty
sure that the number of things people will want to expose over
logical replication will increase.

I know of at least two things that have at least some working code:
Exposing 2PC to logical decoding to allow optionally synchronous
replication, and allowing to send transactional/nontransactional
'messages' via the WAL without writing to a table.

Now, we could add a framework to attach general information to every
record - but I have a very hard time seing how this will be of
comparable complexity *and* efficiency.

b) It's dead simple with a pretty darn low cost. Both from a runtime as
well as a maintenance perspective.

c) There needs to be crash recovery interation anyway to compute the
state of how far replication succeeded before crashing. So it's not
like we could make this completely extensible without core code
knowing.

> And who's to say that a node ID is the most useful piece of information for
> a replication system to add to the WAL header. I can easily imagine that
> you'd want to put a changeset ID or something else in there, instead. (I
> mentioned another example of this in
> http://www.postgresql.org/message-id/4FE17043.60403@enterprisedb.com)

I'm onboard with adding a extensible facility to attach data to
successful transactions. There've been at least two people asking me
directly about how to e.g. attach user information to transactions.

I don't think that's equivalent with what I'm talking about here
though. One important thing about this proposal is that it allows to
completely skip (nearly, except cache inval) all records with a
uninteresting origin id *before* decoding them. Without having to keep
any per transaction state about 'uninteresting' transactions.

> If we need additional information added to WAL records, for extensions, then
> that should be made in an extensible fashion

I can see how we'd do that for individual records (e.g. the various
commit records, after combining them), but i have a hard time seing the
cost of doing that for all records worth it. Especially as it seems
likely to require significant increases in wal volume?

> IIRC (I couldn't find a link
> right now), when we discussed the changes to heap_insert et al for
> wal_level=logical, I already argued back then that we should make it
> possible for extensions to annotate WAL records, with things like "this is
> the primary key", or whatever information is needed for conflict resolution,
> or handling loops. I don't like it that we're adding little pieces of
> information to the WAL format, bit by bit.

I don't think this is "adding little pieces of information to the WAL
format, bit by bit.". It's a relatively central piece for allowing
efficient and maintainable logical replication.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-10-02 12:28:36
Message-ID: CA+TgmoZ8UqRKyS7GP88QmHCx2gvwn_3hNnU4SmL=H_aRf7R1Lg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> An origin column in the table itself helps tremendously to debug issues with
> the replication system. In many if not most scenarios, I think you'd want to
> have that extra column, even if it's not strictly required.

I like a lot of what you wrote here, but I strongly disagree with this
part. A good replication solution shouldn't require changes to the
objects being replicated. The triggers that Slony and other current
logical replication solutions use are a problem not only because
they're slow (although that is a problem) but also because they
represent a user-visible wart that people who don't otherwise care
about the fact that their database is being replicated have to be
concerned with. I would agree that some people might, for particular
use cases, want to include origin information in the table that the
replication system knows about, but it shouldn't be required.

When you look at the replication systems that we have today, you've
basically got streaming replication, which is high-performance and
fairly hands-off (at least once you get it set up properly; that part
can be kind of a bear) but can't cross versions let alone database
systems and requires that the slaves be strictly read-only. Then on
the flip side you've got things like Slony, Bucardo, and others. Some
of these allow multi-master; all of them at least allow table-level
determination of which server has the writable copy. Nearly all of
them are cross-version and some even allow replication into
non-PostgreSQL systems. But they are slow *and administratively
complex*. If we're able to get something that feels like streaming
replication from a performance and administrative complexity point of
view but can be cross-version and allow at least some writes on
slaves, that's going to be an epic leap forward for the project.

In my mind, that means it's got to be completely hands-off from a
schema design point of view: you should be able to start up a database
and design it however you want, put anything you like into it, and
then decide later that you want to bolt logical replication on top of
it, just as you can for streaming physical replication.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-10-04 20:13:30
Message-ID: 543054EA.50902@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/2/14, 7:28 AM, Robert Haas wrote:
> On Thu, Oct 2, 2014 at 4:49 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> >An origin column in the table itself helps tremendously to debug issues with
>> >the replication system. In many if not most scenarios, I think you'd want to
>> >have that extra column, even if it's not strictly required.
> I like a lot of what you wrote here, but I strongly disagree with this
> part. A good replication solution shouldn't require changes to the
> objects being replicated.
I agree that asking users to modify objects is bad, but I also think that if you do have records coming into one table from multiple sources then you will need to know what system they originated on.

Maybe some sort of "hidden" column would work here? That means users don't need to modify anything (including anything doing SELECT *), but the data is there.

As for space concerns I think the answer there is to somehow normalize the identifiers themselves. That has the added benefit of allowing a rename of a source to propagate to all the data already replicated from that source.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-10-29 19:46:41
Message-ID: CA+U5nMKsrV+vYt7kcp4bCC4c+5jL_RPYEcySuL3R-WNo_jp08Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2 October 2014 09:49, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

>> What I've previously suggested (and which works well in BDR) is to add
>> the internal id to the XLogRecord struct. There's 2 free bytes of
>> padding that can be used for that purpose.
>
>
> Adding a field to XLogRecord for this feels wrong. This is for *logical*
> replication - why do you need to mess with something as physical as the WAL
> record format?

Some reasons why this would be OK:

1. We've long agreed that adding information to the WAL record is OK,
just as long as it doesn't interfere (much) with the primary purpose.
It seems OK to change wal_level so it is a list of additional info,
e.g. wal_level = 'hot standby, logical, originid', or just wal_level =
'$new_level_name' that includes originid info

2. We seem to have agreed elsewhere that extensions to WAL will be
allowed. So perhaps redefining those bytes is something that can be
re-used. That way we don't all have to agree what we'll use them for.
Just call a user defined function that returns a 2 byte integer, or
zero if no plugin.

3. As for how many bytes there are - I count 6 wasted bytes on a
MAXALIGN=8 record. Plus we discussed long ago ways we can save another
4 bytes on records that don't have a backup block, since in that case
xl_tot_len == xl_len. I've also got a feeling that WAL records that
are 2 billion bytes long might be overkill. I figure we could easily
make a length limit of something less than that - only commit records
can be longer than 2^19 bytes when they have >65536 subxids, which is
hardly common. Plus xl_prev is always at least 4 byte aligned, so
there are at least 3 bits to be reclaimed from there. (Plus we have 7
unused RmgrId bits, though maybe we want to keep those)

So I count about 14 bytes of potentially reclaimable space in the WAL
record header, or 10 with backup blocks. The reason we never reclaimed
it before was that benchmarks previously showed that reducing the
volume of WAL had little effect on performance, we weren't looking
specifically at WAL volume or info content. (And the perf result is
probably different now anyway).

Should we grovel around to reclaim any of that? Not necessary, the
next person with a good reason to use some space can do that.

Pick any of those: I see no reason to prevent reusing 2 bytes for a
useful purpose, when we do all agree it is a useful purpose.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 00:21:55
Message-ID: 20150216002155.GI15326@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Here's my next attept attempt at producing something we can agree
upon.

The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.

Other changes:

* Locking preventing several backends to replay changes at the same
time. This is actually overly restrictive in some cases, but I think
good enough for now.
* Logical decoding grew a filter_by_origin callback that allows to
ignore changes that were replayed on a remote system. Such filters are
executed before much is done with records, potentially saving a fair
bit of costs.
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication
identifier).

I also want to quickly recap replication identifiers, given that
in-person conversations with several people proved that the concept was
slightly misunderstood:

Think about a logical replication solution trying to replay changes. The
postmaster in which the data is replayed into crashes every now and
then. Replication identifiers allow you to do something like:

do_replication()
{
source = ConnectToSourceSystem('mysource');
target = ConnectToSourceSystem('target');

# mark we're replayin
target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$);
# get how far we've replayed last time round
remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progress WHERE external_id = 'myrep_mysource');

# and now replay changes
copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn);

while (record = copystream.get_record())
{
if (record.type = 'begin')
{
target.exec('BEGIN');
# setup the position of this individual xact
target.exec('SELECT pg_replication_identifier_setup_tx_origin($1, $2);',
record.origin_lsn, record.origin_commit_timestamp);
}
else if (record.type = 'change')
target.exec(record.change_sql)
else if (record.type = 'commit')
target.exec('COMMIT');
}
}

A non pseudocode version of the above would be safe against crashes of
both the source and the target system. If the target system crashes the
replication identifier logic will recover how far we replayed during
crash recovery. If the source system crashes/disconnects we'll have the
current value in memory. Note that this works perfectly well if the
target system (and obviously the source system, but that's obvious) use
synchronous_commit = off - we'll not miss any changes.

Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.

The SQL interface consists out of:
# manage existance of identifiers
internal_id pg_replication_identifier_create(external_id);
void pg_replication_identifier_drop(external_id);

# replay management
void pg_replication_identifier_setup_replaying_from(external_id);
void pg_replication_identifier_reset_replaying_from();
bool pg_replication_identifier_is_replaying();
void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time);

# replication progress status view
SELECT * FROM pgreplication_identifier_progress;

# replicatation identifiers
SELECT * FROM pg_replication_identifier;

Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a
SQL reimplementation of replication identifiers and that has proven that
for busier workloads doing a table update to store the replication
progress indeed has a noticeable overhead. Especially if there's some
longer running activity on the standby.

The bigger questions I have are:

1) Where to store the origin. I personally still think that using the
padding is fine. Now that I have proven that it's pretty simple to
store additional information the argument that it might be needed for
something else doesn't really hold anymore. But I can live with the
other solution as well - 3 bytes additional overhead ain't so bad.

2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do
we want to store the origin only on relevant records? That'd be
XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE //
XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED. I'm thinking of something
like XLogLogOriginIfAvailable() before the emitting log
XLogInsert()s.

3) There should be a lwlock for the individual replication identifier
progress slots.

4) Right now identifier progress is stored during checkpoints in special
files - maybe it'd be better to store them inside the checkpoint
record somehow. We read that even after a clean shutdown, so that
should be fine.

5) I'm think there are issues with a streaming replication standby if
many identifiers are created/dropped. Those shouldn't be too hard to
fix.

6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't
acceptable. Either I need to implement boostrap support for marking
varlenas NOT NULL as discussed nearby or replace the syscache lookup
with a index lookup.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 00:24:21
Message-ID: 20150216002421.GJ15326@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
> Here's my next attept attempt at producing something we can agree
> upon.
>
> The major change that might achieve that is that I've now provided a
> separate method to store the origin_id of a node. I've made it
> conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
> paths. That new method uses Heikki's xlog rework to dynamically add the
> origin to the record if a origin is set up. That works surprisingly
> simply.
>
> Other changes:
>
> * Locking preventing several backends to replay changes at the same
> time. This is actually overly restrictive in some cases, but I think
> good enough for now.
> * Logical decoding grew a filter_by_origin callback that allows to
> ignore changes that were replayed on a remote system. Such filters are
> executed before much is done with records, potentially saving a fair
> bit of costs.
> * Rebased. That took a bit due the xlog and other changes.
> * A couple more SQL interface functions (like dropping a replication
> identifier).

And here an actual patch.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Introduce-replication-identifiers-to-keep-track-of-r.patch text/x-patch 91.4 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 09:07:09
Message-ID: 54E1B33D.6030204@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/16/2015 02:21 AM, Andres Freund wrote:
> Furthermore the fact that the origin of records is recorded allows to
> avoid decoding them in logical decoding. That has both efficiency
> advantages (we can do so before they are stored in memory/disk) and
> functionality advantages. Imagine using a logical replication solution
> to replicate inserts to a single table between two databases where
> inserts are allowed on both - unless you prevent the replicated inserts
> from being replicated again you obviously have a loop. This
> infrastructure lets you avoid that.

That makes sense.

How does this work if you have multiple replication systems in use in
the same cluster? You might use Slony to replication one table to one
system, and BDR to replication another table with another system. Or the
same replication software, but different hosts.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 09:18:39
Message-ID: 20150216091839.GA20205@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:
> On 02/16/2015 02:21 AM, Andres Freund wrote:
> >Furthermore the fact that the origin of records is recorded allows to
> >avoid decoding them in logical decoding. That has both efficiency
> >advantages (we can do so before they are stored in memory/disk) and
> >functionality advantages. Imagine using a logical replication solution
> >to replicate inserts to a single table between two databases where
> >inserts are allowed on both - unless you prevent the replicated inserts
> >from being replicated again you obviously have a loop. This
> >infrastructure lets you avoid that.
>
> That makes sense.
>
> How does this work if you have multiple replication systems in use in the
> same cluster? You might use Slony to replication one table to one system,
> and BDR to replication another table with another system. Or the same
> replication software, but different hosts.

It should "just work". Replication identifiers are identified by a free
form text, each replication solution can add the
information/distinguising data they need in there.

Bdr uses something like
#define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s"
with
remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
as parameters as a replication identifier name.

I've been wondering whether the bdr_ part in the above should be split
of into a separate field, similar to how the security label stuff does
it. But I don't think it'd really buy us much, especially as we did
not do that for logical slot names.

Each of the used replication solution would probably ask their output
plugin to only stream locally generated (i.e. origin_id =
InvalidRepNodeId) changes, and possibly from a defined list of other
known hosts in the cascading case.

Does that answer your question?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 09:34:10
Message-ID: 54E1B992.9040002@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/16/2015 11:18 AM, Andres Freund wrote:
> On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote:
>> How does this work if you have multiple replication systems in use in the
>> same cluster? You might use Slony to replication one table to one system,
>> and BDR to replication another table with another system. Or the same
>> replication software, but different hosts.
>
> It should "just work". Replication identifiers are identified by a free
> form text, each replication solution can add the
> information/distinguising data they need in there.
>
> Bdr uses something like
> #define BDR_NODE_ID_FORMAT "bdr_"UINT64_FORMAT"_%u_%u_%u_%s"
> with
> remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name
> as parameters as a replication identifier name.
>
> I've been wondering whether the bdr_ part in the above should be split
> of into a separate field, similar to how the security label stuff does
> it. But I don't think it'd really buy us much, especially as we did
> not do that for logical slot names.
>
> Each of the used replication solution would probably ask their output
> plugin to only stream locally generated (i.e. origin_id =
> InvalidRepNodeId) changes, and possibly from a defined list of other
> known hosts in the cascading case.
>
> Does that answer your question?

Yes, thanks. Note to self and everyone else looking at this: It's
important to keep in mind is that the replication IDs are completely
internal to the local cluster. They are *not* sent over the wire.

That's not completely true if you also use physical replication, though.
The replication IDs will be included in the WAL stream. Can you have
logical decoding running in a hot standby server? And how does the
progress report stuff that's checkpointed in pg_logical/checkpoints work
in a hot standby? (Sorry if I'm not making sense, I haven't really
thought hard about this myself)

At a quick glance, this basic design seems workable. I would suggest
expanding the replication IDs to regular 4 byte oids. Two extra bytes is
a small price to pay, to make it work more like everything else in the
system.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 09:46:29
Message-ID: 20150216094628.GC20205@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
> Yes, thanks. Note to self and everyone else looking at this: It's important
> to keep in mind is that the replication IDs are completely internal to the
> local cluster. They are *not* sent over the wire.

Well, if you *want* to, you could send the external (free form text)
replication identifiers over the wire in the output plugin. There are
scenarios where that might make sense.

> That's not completely true if you also use physical replication, though. The
> replication IDs will be included in the WAL stream.

Right. But then a physical rep server isn't realy a different server.

> Can you have logical decoding running in a hot standby server?

Not at the moment, there's some minor stuff missing (following
timelines, enforcing tuple visibility on the primary).

> And how does the progress report stuff that's checkpointed in
> pg_logical/checkpoints work in a hot standby? (Sorry if I'm not
> making sense, I haven't really thought hard about this myself)

It doesn't work that greatly atm, they'd need to be WAL logged for that
- which would not be hard. It'd be better to include the information in
the checkpoint, but that unfortunately doesn't really work, because we
store the checkpoint in the control file. And that has to be <=
512 bytes.

What, I guess, we could do is log it in the checkpoint, after
determining the redo pointer, and store the LSN for it in the checkpoint
record proper. That'd mean we'd read one more record at startup, but
that isn't particularly bad.

> At a quick glance, this basic design seems workable. I would suggest
> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
> small price to pay, to make it work more like everything else in the system.

I don't know. Growing from 3 to 5 byte overhead per relevant record (or
even 0 to 5 in case the padding is reused) is rather noticeable. If we
later find it to be a limit (I seriously doubt that), we can still
increase it in a major release without anybody really noticing.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-18 23:49:50
Message-ID: 54E5251E.5060702@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16/02/15 10:46, Andres Freund wrote:
> On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
>
>> At a quick glance, this basic design seems workable. I would suggest
>> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
>> small price to pay, to make it work more like everything else in the system.
>
> I don't know. Growing from 3 to 5 byte overhead per relevant record (or
> even 0 to 5 in case the padding is reused) is rather noticeable. If we
> later find it to be a limit (I seriously doubt that), we can still
> increase it in a major release without anybody really noticing.
>

I agree that limiting the overhead is important.

But I have related though about this - I wonder if it's worth to try to
map this more directly to the CommitTsNodeId. I mean it is currently
using that for the actual storage, but I am thinking of having the
Replication Identifiers be the public API and the CommitTsNodeId stuff
be just hidden implementation detail. This thought is based on the fact
that CommitTsNodeId provides somewhat meaningless number while
Replication Identifiers give us nice name for that number. And if we do
that then the type should perhaps be same for both?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 03:59:30
Message-ID: 54E95422.3060502@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Now that the issue with padding seems to no longer exists since the
patch works both with and without padding, I went through the code and
here are some comments I have (in no particular order).

In CheckPointReplicationIdentifier:
> + * FIXME: Add a CRC32 to the end.

The function already does it (I guess you forgot to remove the comment).

Using max_replication_slots as limit for replication_identifier states
does not really make sense to me as replication_identifiers track remote
info while and slots are local and in case of master-slave replication
you need replication identifiers but don't need slots.

In bootstrap.c:
> #define MARKNOTNULL(att) \
> ((att)->attlen > 0 || \
> (att)->atttypid == OIDVECTOROID || \
> - (att)->atttypid == INT2VECTOROID)
> + (att)->atttypid == INT2VECTOROID || \
> + strcmp(NameStr((att)->attname), "riname") == 0 \
> + )

Huh? Can this be solved in a nicer way?

Since we call XLogFlush with local_lsn as parameter, shouldn't we check
that it's actually within valid range?
Currently we'll get errors like this if set to invalid value:
ERROR: xlog flush request 123/123 is not satisfied --- flushed only to
0/168FB18

In AdvanceReplicationIndentifier:
> + /*
> + * XXX: should we restore into a hashtable and dump into shmem only after
> + * recovery finished?
> + */

Probably no given that the function is also callable via SQL interface.

As I wrote in another email, I would like to integrate the RepNodeId and
CommitTSNodeId into single thing.

There are no docs for the new sql interfaces.

The replication_identifier.c might deserve some intro/notes text.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 08:52:07
Message-ID: 20150222085207.GA21496@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-22 04:59:30 +0100, Petr Jelinek wrote:
> Now that the issue with padding seems to no longer exists since the patch
> works both with and without padding, I went through the code and here are
> some comments I have (in no particular order).
>
> In CheckPointReplicationIdentifier:
> >+ * FIXME: Add a CRC32 to the end.
>
> The function already does it (I guess you forgot to remove the comment).

Yep. I locally have a WIP version that's much cleaned up and doesn't
contain it anymore.

> Using max_replication_slots as limit for replication_identifier states does
> not really make sense to me as replication_identifiers track remote info
> while and slots are local and in case of master-slave replication you need
> replication identifiers but don't need slots.

On the other hand, it's quite cheap if unused. Not sure if several
variables are worth it.

> In bootstrap.c:
> > #define MARKNOTNULL(att) \
> > ((att)->attlen > 0 || \
> > (att)->atttypid == OIDVECTOROID || \
> >- (att)->atttypid == INT2VECTOROID)
> >+ (att)->atttypid == INT2VECTOROID || \
> >+ strcmp(NameStr((att)->attname), "riname") == 0 \
> >+ )
>
> Huh? Can this be solved in a nicer way?

Yes. I'd mentioned that this is just a temporary hack ;). I've since
pushed a more proper solution; BKI_FORCE_NOT_NULL can now be specified
on the column.

> Since we call XLogFlush with local_lsn as parameter, shouldn't we check that
> it's actually within valid range?
> Currently we'll get errors like this if set to invalid value:
> ERROR: xlog flush request 123/123 is not satisfied --- flushed only to
> 0/168FB18

I think we should rather remove local_lsn from all parameters that are
user callable, adding them there was a mistake. It's really only
relevant for the cases where it's called by commit.

> In AdvanceReplicationIndentifier:
> >+ /*
> >+ * XXX: should we restore into a hashtable and dump into shmem only after
> >+ * recovery finished?
> >+ */
>
> Probably no given that the function is also callable via SQL interface.

Well, it's still a good idea regardless...

> As I wrote in another email, I would like to integrate the RepNodeId and
> CommitTSNodeId into single thing.

Will reply separately there.

> There are no docs for the new sql interfaces.

Yea. The whole thing needs docs.

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 08:57:16
Message-ID: 20150222085716.GB21496@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:
> On 16/02/15 10:46, Andres Freund wrote:
> >On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
> >
> >>At a quick glance, this basic design seems workable. I would suggest
> >>expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
> >>small price to pay, to make it work more like everything else in the system.
> >
> >I don't know. Growing from 3 to 5 byte overhead per relevant record (or
> >even 0 to 5 in case the padding is reused) is rather noticeable. If we
> >later find it to be a limit (I seriously doubt that), we can still
> >increase it in a major release without anybody really noticing.
> >
>
> I agree that limiting the overhead is important.
>
> But I have related though about this - I wonder if it's worth to try to map
> this more directly to the CommitTsNodeId.

Maybe. I'd rather go the other way round though;
replication_identifier.c/h's stuff seems much more generic than
CommitTsNodeId.

> I mean it is currently using that
> for the actual storage, but I am thinking of having the Replication
> Identifiers be the public API and the CommitTsNodeId stuff be just hidden
> implementation detail. This thought is based on the fact that CommitTsNodeId
> provides somewhat meaningless number while Replication Identifiers give us
> nice name for that number. And if we do that then the type should perhaps be
> same for both?

I'm not sure. Given that this is included in a significant number of
recordsd I'd really rather not increase the overhead as described
above. Maybe we can just limit CommitTsNodeId to the same size for now?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 13:04:18
Message-ID: 54E9D3D2.4080103@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22/02/15 09:57, Andres Freund wrote:
> On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote:
>> On 16/02/15 10:46, Andres Freund wrote:
>>> On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote:
>>>
>>>> At a quick glance, this basic design seems workable. I would suggest
>>>> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
>>>> small price to pay, to make it work more like everything else in the system.
>>>
>>> I don't know. Growing from 3 to 5 byte overhead per relevant record (or
>>> even 0 to 5 in case the padding is reused) is rather noticeable. If we
>>> later find it to be a limit (I seriously doubt that), we can still
>>> increase it in a major release without anybody really noticing.
>>>
>>
>> I agree that limiting the overhead is important.
>>
>> But I have related though about this - I wonder if it's worth to try to map
>> this more directly to the CommitTsNodeId.
>
> Maybe. I'd rather go the other way round though;
> replication_identifier.c/h's stuff seems much more generic than
> CommitTsNodeId.
>

Probably not more generic but definitely nicer as the nodes are named
and yes it has richer API.

>> I mean it is currently using that
>> for the actual storage, but I am thinking of having the Replication
>> Identifiers be the public API and the CommitTsNodeId stuff be just hidden
>> implementation detail. This thought is based on the fact that CommitTsNodeId
>> provides somewhat meaningless number while Replication Identifiers give us
>> nice name for that number. And if we do that then the type should perhaps be
>> same for both?
>
> I'm not sure. Given that this is included in a significant number of
> recordsd I'd really rather not increase the overhead as described
> above. Maybe we can just limit CommitTsNodeId to the same size for now?
>

That would make sense.

I also wonder about the default nodeid infrastructure the committs
provides. I can't think of use-case which it can be used for and which
isn't solved by replication identifiers - in the end the main reason I
added that infra was to make it possible to write something like
replication identifiers as part of an extension. In any case I don't
think the default nodeid can be used in parallel with replication
identifiers since one will overwrite the SLRU record for the other.
Maybe it's enough if this is documented but I think it might be better
if this patch removed that default committs nodeid infrastructure. It's
just few lines of code which nobody should be using yet.

Thinking about this some more and rereading the code I see that you are
setting TransactionTreeSetCommitTsData during xlog replay, but not
during transaction commit, that does not seem correct as the local
records will not have any nodeid/origin.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 15:03:59
Message-ID: 54E9EFDF.1060105@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/15/15 7:24 PM, Andres Freund wrote:
> On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
>> Here's my next attept attempt at producing something we can agree
>> upon.
>>
>> The major change that might achieve that is that I've now provided a
>> separate method to store the origin_id of a node. I've made it
>> conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
>> paths. That new method uses Heikki's xlog rework to dynamically add the
>> origin to the record if a origin is set up. That works surprisingly
>> simply.

I'm trying to figure out what this feature is supposed to do, but the
patch contains no documentation or a commit message. Where is one
supposed to start?


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-22 15:12:34
Message-ID: 20150222151234.GD21496@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-22 10:03:59 -0500, Peter Eisentraut wrote:
> On 2/15/15 7:24 PM, Andres Freund wrote:
> > On 2015-02-16 01:21:55 +0100, Andres Freund wrote:
> >> Here's my next attept attempt at producing something we can agree
> >> upon.
> >>
> >> The major change that might achieve that is that I've now provided a
> >> separate method to store the origin_id of a node. I've made it
> >> conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
> >> paths. That new method uses Heikki's xlog rework to dynamically add the
> >> origin to the record if a origin is set up. That works surprisingly
> >> simply.
>
> I'm trying to figure out what this feature is supposed to do, but the
> patch contains no documentation or a commit message. Where is one
> supposed to start?

For a relatively short summary:
http://archives.postgresql.org/message-id/20150216002155.GI15326%40awork2.anarazel.de

For a longer one:
http://www.postgresql.org/message-id/20140923182422.GA15776@alap3.anarazel.de

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-03-24 15:33:13
Message-ID: 20150324153313.GC28418@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Here's the next version of this patch. I've tried to address the biggest
issue (documentation) and some more. Now that both the more flexible
commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
looks much cleaner.

Changes:
* Loads of documentation and comments
* Revamped locking strategy. There's now a LWLock protecting all the
replication progress array and spinlock for the individual sltos.
* Simpler checkpoint format.
* A new pg_replication_identifier_progress() function returning a
individual identifier's replication progress; previously there was
only the view showing all of them.
* Lots of minor cleanup
* Some more tests

I'd greatly appreciate some feedback on the documentation. I'm not
entirely sure into how much detail to go; and where exactly in the docs
to place it. I do wonder if we shouldn't merge this with the logical
decoding section and whether we could also document commit timestamps
somewhere in there.

I've verified that this correctly works on a stanby; replication
progress is replicated correctly. I think there's two holes though:
Changing the replication progress without replicating anything and
dropping a replication identifier with some replication progress might
not work correctly. That's fairly easily fixed and I intend to do so.

Other than that I'm not aware of outstanding issues.

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Introduce-replication-identifiers-v1.0.patch text/x-patch 122.8 KB

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-03-24 21:22:29
Message-ID: 5511D595.3060207@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/03/15 16:33, Andres Freund wrote:
> Hi,
>
> Here's the next version of this patch. I've tried to address the biggest
> issue (documentation) and some more. Now that both the more flexible
> commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it
> looks much cleaner.
>

Nice, I see you also did the more close integration with CommitTs. I
only skimmed the patch but so far and it looks quite good, I'll take
closer look around end of the week.

>
> I'd greatly appreciate some feedback on the documentation. I'm not
> entirely sure into how much detail to go; and where exactly in the docs
> to place it. I do wonder if we shouldn't merge this with the logical
> decoding section and whether we could also document commit timestamps
> somewhere in there.

Perhaps we should have some Logical replication developer documentation
section and put all those three as subsections of that?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-03-25 03:11:26
Message-ID: CA+TgmoaJ_-CbnqiTw-xe_7_y-nPag=qwZrDYaCQUJvMECTcF7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> At a quick glance, this basic design seems workable. I would suggest
>> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
>> small price to pay, to make it work more like everything else in the system.
>
> I don't know. Growing from 3 to 5 byte overhead per relevant record (or
> even 0 to 5 in case the padding is reused) is rather noticeable. If we
> later find it to be a limit (I seriously doubt that), we can still
> increase it in a major release without anybody really noticing.

You might notice that Heikki is making the same point here that I've
attempted to make multiple times in the past: limiting to replication
identifier to 2 bytes because that's how much padding space you happen
to have available is optimizing for the wrong thing. What we should
be optimizing for is consistency and uniformity of design. System
catalogs have OIDs, so this one should, too. You're not going to be
able to paper over the fact that the column has some funky data type
that is unlike every other column in the system.

To the best of my knowledge, the statement that there is a noticeable
performance cost for those 2 extra bytes is also completely
unsupported by any actual benchmarking.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-03-28 22:50:20
Message-ID: 5517302C.6080007@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So I did some more in depth look, I do have couple of comments.

I would really like to have something like "Logical Replication
Infrastructure" doc section that would have both decoding and
identifiers (and possibly even CommitTs) underneath.

There is typo in docs:
> + <para>
> + The optional <function>filter_by_origin_cb</function> callback
> + is called to determine wheter data that has been replayed

wheter -> whether

And finally I have issue with how the new identifiers are allocated.
Currently, if you create identifier 'foo', remove identifier 'foo' and
create identifier 'bar', the identifier 'bar' will have same id as the
old 'foo' identifier. This can be problem because the identifier id is
used as origin of the data and the replication solution using the
replication identifiers can end up thinking that data came from node
'bar' even though they came from the node 'foo' which no longer exists.
This can have bad effects for example on conflict detection or debugging
problems with replication.

Maybe another reason to use standard Oids?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 14:30:25
Message-ID: 20150407143025.GC12291@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:
> And finally I have issue with how the new identifiers are allocated.
> Currently, if you create identifier 'foo', remove identifier 'foo' and
> create identifier 'bar', the identifier 'bar' will have same id as the old
> 'foo' identifier. This can be problem because the identifier id is used as
> origin of the data and the replication solution using the replication
> identifiers can end up thinking that data came from node 'bar' even though
> they came from the node 'foo' which no longer exists. This can have bad
> effects for example on conflict detection or debugging problems with
> replication.
>
> Maybe another reason to use standard Oids?

As the same reason exists for oids, just somewhat less likely, I don't
see it as a reason for much. It's really not that hard to get oid
conflicts once your server has lived for a while. As soon as the oid
counter has wrapped around once, it's not unlikely to have
conflicts. And with temp tables (or much more extremely WITH OID tables)
and such it's not that hard to reach that point. The only material
difference this makes is that it's much easier to notice the problem
during development.

Greetings,

Andres Freund


From: Andres Freund <andres(at)anarazel(dot)de>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 14:37:05
Message-ID: 20150407143705.GD12291@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
> And with temp tables (or much more extremely WITH OID tables)
> and such it's not that hard to reach that point.

Oh, and obviously toast data. A couple tables with toasted columns is
also a good way to rapidly consume oids.

Greetings,

Andres Freund


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 15:08:16
Message-ID: 20150407150816.GF12291@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-24 23:11:26 -0400, Robert Haas wrote:
> On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> At a quick glance, this basic design seems workable. I would suggest
> >> expanding the replication IDs to regular 4 byte oids. Two extra bytes is a
> >> small price to pay, to make it work more like everything else in the system.
> >
> > I don't know. Growing from 3 to 5 byte overhead per relevant record (or
> > even 0 to 5 in case the padding is reused) is rather noticeable. If we
> > later find it to be a limit (I seriously doubt that), we can still
> > increase it in a major release without anybody really noticing.
>
> You might notice that Heikki is making the same point here that I've
> attempted to make multiple times in the past: limiting to replication
> identifier to 2 bytes because that's how much padding space you happen
> to have available is optimizing for the wrong thing. What we should
> be optimizing for is consistency and uniformity of design. System
> catalogs have OIDs, so this one should, too. You're not going to be
> able to paper over the fact that the column has some funky data type
> that is unlike every other column in the system.
>
> To the best of my knowledge, the statement that there is a noticeable
> performance cost for those 2 extra bytes is also completely
> unsupported by any actual benchmarking.

I'm starting benchmarks now.

But I have to say: I find the idea that you'd need more than 2^16
identifiers anytime soon not very credible. The likelihood that
replication identifiers are the limiting factor towards that seems
incredibly small. Just consider how you'd apply changes from so many
remotes; how to stream changes to them; how to even configure such a
complex setup. We can easily change the size limits in the next major
release without anybody being inconvenienced.

We've gone through quite some lengths reducing the overhead of WAL. I
don't understand why it's important that we do not make compromises
here; but why that doesn't matter elsewhere.

Greetings,

Andres Freund


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 15:57:27
Message-ID: 5523FE67.5020900@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/7/15 9:30 AM, Andres Freund wrote:
> On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote:
>> And finally I have issue with how the new identifiers are allocated.
>> Currently, if you create identifier 'foo', remove identifier 'foo' and
>> create identifier 'bar', the identifier 'bar' will have same id as the old
>> 'foo' identifier. This can be problem because the identifier id is used as
>> origin of the data and the replication solution using the replication
>> identifiers can end up thinking that data came from node 'bar' even though
>> they came from the node 'foo' which no longer exists. This can have bad
>> effects for example on conflict detection or debugging problems with
>> replication.
>>
>> Maybe another reason to use standard Oids?
>
> As the same reason exists for oids, just somewhat less likely, I don't
> see it as a reason for much. It's really not that hard to get oid
> conflicts once your server has lived for a while. As soon as the oid
> counter has wrapped around once, it's not unlikely to have
> conflicts. And with temp tables (or much more extremely WITH OID tables)
> and such it's not that hard to reach that point. The only material
> difference this makes is that it's much easier to notice the problem
> during development.

Why not just create a sequence? I suspect it may not be as fast to
assign as an OID, but it's not like you'd be doing this all the time.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 15:58:18
Message-ID: 20150407155818.GG12291@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Why not just create a sequence? I suspect it may not be as fast to assign as
> an OID, but it's not like you'd be doing this all the time.

What does that have to do with the thread?

Greetings,

Andres Freund


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-07 22:39:32
Message-ID: 55245CA4.5050306@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/7/15 10:58 AM, Andres Freund wrote:
>> Why not just create a sequence? I suspect it may not be as fast to assign as
>> an OID, but it's not like you'd be doing this all the time.
>
> What does that have to do with the thread?

The original bit was...

> And finally I have issue with how the new identifiers are allocated.
> Currently, if you create identifier 'foo', remove identifier 'foo' and
> create identifier 'bar', the identifier 'bar' will have same id as the old
> 'foo' identifier. This can be problem because the identifier id is used as
> origin of the data and the replication solution using the replication
> identifiers can end up thinking that data came from node 'bar' even though
> they came from the node 'foo' which no longer exists. This can have bad
> effects for example on conflict detection or debugging problems with
> replication.
>
> Maybe another reason to use standard Oids?

Wasn't the reason for using OIDs so that we're not doing the equivalent
of max(identifier) + 1?

Perhaps I'm just confused...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-08 04:59:46
Message-ID: CAB7nPqQt3g_wvPNOb5SYnYVBF0v=4AJO5w3+hnZg5iU-eGKWSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
>> And with temp tables (or much more extremely WITH OID tables)
>> and such it's not that hard to reach that point.
>
> Oh, and obviously toast data. A couple tables with toasted columns is
> also a good way to rapidly consume oids.

You are forgetting as well large objects on the stack, when client
application does not assign an OID by itself.
--
Michael


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-08 12:17:04
Message-ID: 55251C40.1030009@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/04/15 06:59, Michael Paquier wrote:
> On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> On 2015-04-07 16:30:25 +0200, Andres Freund wrote:
>>> And with temp tables (or much more extremely WITH OID tables)
>>> and such it's not that hard to reach that point.
>>
>> Oh, and obviously toast data. A couple tables with toasted columns is
>> also a good way to rapidly consume oids.
>
> You are forgetting as well large objects on the stack, when client
> application does not assign an OID by itself.
>

And you guys are not getting my point. What I proposed was to not reuse
the RI id immediately because that can make debugging issues with
replication/conflict handling harder when something happens after
cluster configuration has changed. Whether it's done using Oid or some
other way, I don't really care and wrapping around eventually is ok,
since the old origin info for transactions will be cleared out during
the freeze at the latest anyway.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-08 12:22:29
Message-ID: 20150408122229.GA9764@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote:
> And you guys are not getting my point. What I proposed was to not reuse the
> RI id immediately because that can make debugging issues with
> replication/conflict handling harder when something happens after cluster
> configuration has changed.

If that's the goal, you shouldn't delete the replication identifier at
that point. That's the only sane way preventing it from being reused.

> Whether it's done using Oid or some other way, I don't really care and
> wrapping around eventually is ok, since the old origin info for
> transactions will be cleared out during the freeze at the latest
> anyway.

How are you proposing to do the allocation then? There's no magic
preventing immediate reuse with oids or anything else. The oid counter
might *already* have wrapped around and point exactly to the identifier
you're about to delete. Then when you deleted it it's going to be reused
for the next allocated oid.

Andres Freund


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-08 12:24:53
Message-ID: 55251E15.6030704@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/04/15 14:22, Andres Freund wrote:
> On 2015-04-08 14:17:04 +0200, Petr Jelinek wrote:
>> And you guys are not getting my point. What I proposed was to not reuse the
>> RI id immediately because that can make debugging issues with
>> replication/conflict handling harder when something happens after cluster
>> configuration has changed.
>
> If that's the goal, you shouldn't delete the replication identifier at
> that point. That's the only sane way preventing it from being reused.
>

Ok, I am happy with that solution.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-10 16:03:05
Message-ID: 20150410160305.GG32335@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
> I'm starting benchmarks now.

What I'm benchmarking here is the WAL overhead, since that's what we're
debating.

The test setup I used was a pgbench scale 10 instance. I've run with
full_page_write=off to have more reproducible results. This of course
over-emphasizes the overhead a bit. But for a long checkpoint interval
and a memory resident working set it's not that unrealistic.

I ran 50k transactions in a signle b
baseline:
- 20445024
- 20437128
- 20436864
- avg: 20439672
extern 2byte identifiers:
- 23318368
- 23148648
- 23128016
- avg: 23198344
- avg overhead: 13.5%
padding 2byte identifiers:
- 21160408
- 21319720
- 21164280
- avg: 21214802
- avg overhead: 3.8%
extern 4byte identifiers:
- 23514216
- 23540128
- 23523080
- avg: 23525808
- avg overhead: 15.1%

To me that shows pretty clearly that a) reusing the padding is
worthwhile b) even without that using 2byte instead of 4 byte
identifiers is beneficial.

Now. Especially in the case of extern identifiers we *can* optimize a
bit more. But there's no way we can get the efficiency of the version
reusing padding.

To run the benchmarks you need to
SELECT pg_replication_identifier_create('frak');
before starting pgbench with the attached file.

Greetings,

Andres Freund

Attachment Content-Type Size
pgbench.sql text/plain 790 bytes

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-11 23:56:49
Message-ID: 5529B4C1.8030604@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/04/15 18:03, Andres Freund wrote:
> On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
>> I'm starting benchmarks now.
>
> What I'm benchmarking here is the WAL overhead, since that's what we're
> debating.
>
> The test setup I used was a pgbench scale 10 instance. I've run with
> full_page_write=off to have more reproducible results. This of course
> over-emphasizes the overhead a bit. But for a long checkpoint interval
> and a memory resident working set it's not that unrealistic.
>
> I ran 50k transactions in a signle b
> baseline:
> - 20445024
> - 20437128
> - 20436864
> - avg: 20439672
> extern 2byte identifiers:
> - 23318368
> - 23148648
> - 23128016
> - avg: 23198344
> - avg overhead: 13.5%
> padding 2byte identifiers:
> - 21160408
> - 21319720
> - 21164280
> - avg: 21214802
> - avg overhead: 3.8%
> extern 4byte identifiers:
> - 23514216
> - 23540128
> - 23523080
> - avg: 23525808
> - avg overhead: 15.1%
>
> To me that shows pretty clearly that a) reusing the padding is
> worthwhile b) even without that using 2byte instead of 4 byte
> identifiers is beneficial.
>

My opinion is that 10% of WAL size difference is quite high price to pay
so that we can keep the padding for some other, yet unknown feature that
hasn't come up in several years, which would need those 2 bytes.

But if we are willing to pay it then we can really go all the way and
just use Oids...

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-12 19:02:38
Message-ID: 552AC14E.7080308@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/12/2015 02:56 AM, Petr Jelinek wrote:
> On 10/04/15 18:03, Andres Freund wrote:
>> On 2015-04-07 17:08:16 +0200, Andres Freund wrote:
>>> I'm starting benchmarks now.
>>
>> What I'm benchmarking here is the WAL overhead, since that's what we're
>> debating.
>>
>> The test setup I used was a pgbench scale 10 instance. I've run with
>> full_page_write=off to have more reproducible results. This of course
>> over-emphasizes the overhead a bit. But for a long checkpoint interval
>> and a memory resident working set it's not that unrealistic.
>>
>> I ran 50k transactions in a signle b
>> baseline:
>> - 20445024
>> - 20437128
>> - 20436864
>> - avg: 20439672
>> extern 2byte identifiers:
>> - 23318368
>> - 23148648
>> - 23128016
>> - avg: 23198344
>> - avg overhead: 13.5%
>> padding 2byte identifiers:
>> - 21160408
>> - 21319720
>> - 21164280
>> - avg: 21214802
>> - avg overhead: 3.8%
>> extern 4byte identifiers:
>> - 23514216
>> - 23540128
>> - 23523080
>> - avg: 23525808
>> - avg overhead: 15.1%
>>
>> To me that shows pretty clearly that a) reusing the padding is
>> worthwhile b) even without that using 2byte instead of 4 byte
>> identifiers is beneficial.
>>
>
> My opinion is that 10% of WAL size difference is quite high price to pay
> so that we can keep the padding for some other, yet unknown feature that
> hasn't come up in several years, which would need those 2 bytes.
>
> But if we are willing to pay it then we can really go all the way and
> just use Oids...

This needs to be weighed against removing the padding bytes altogether.
See attached. That would reduce the WAL size further when you don't need
replication IDs. It's very straightforward, but need to do some
performance/scalability testing to make sure that using memcpy instead
of a straight 32-bit assignment doesn't hurt performance, since it
happens in very performance critical paths.

I'm surprised there's such a big difference between the "extern" and
"padding" versions above. At a quick approximation, storing the ID as a
separate "fragment", along with XLogRecordDataHeaderShort and
XLogRecordDataHeaderLong, should add one byte of overhead plus the ID
itself. So that would be 3 extra bytes for 2-byte identifiers, or 5
bytes for 4-byte identifiers. Does that mean that the average record
length is only about 30 bytes? That's what it seems like, if adding the
"extern 2 byte identifiers" added about 10% of overhead compared to the
"padding 2 byte identifiers" version. That doesn't sound right, 30 bytes
is very little. Perhaps the size of the records created by pgbench
happen to cross a 8-byte alignment boundary at that point, making a big
difference. In another workload, there might be no difference at all,
due to alignment.

Also, you don't need to tag every record type with the replication ID.
All indexam records can skip it, for starters, since logical decoding
doesn't care about them. That should remove a fair amount of bloat.

- Heikki

Attachment Content-Type Size
remove-xlogrecord-padding-1.patch application/x-patch 3.8 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 08:54:51
Message-ID: 20150417085451.GJ2361@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-12 22:02:38 +0300, Heikki Linnakangas wrote:
> This needs to be weighed against removing the padding bytes
> altogether.

Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable. And you deemed to be perfectly ok to use them up to avoid
*increasing* the WAL size with the *additional data* (which so far
nothing but pg_rewind needs in that way). While it perfectly well could
have been used to shrink the WAL size to less than it now is. And that's
*far*, *far* harder to back out/refactor changes than this (which are
pretty localized and thus can easily be changed); to the point that I
think it's infeasible to do so...

If you want to shrink the WAL size, send in a patch independently. Not
as a way to block somebody else implementing something.

> I'm surprised there's such a big difference between the "extern" and
> "padding" versions above. At a quick approximation, storing the ID as a
> separate "fragment", along with XLogRecordDataHeaderShort and
> XLogRecordDataHeaderLong, should add one byte of overhead plus the ID
> itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes
> for 4-byte identifiers. Does that mean that the average record length is
> only about 30 bytes?

Yes, nearly. That's xlogdump --stats=record from the above scenario with
replication identifiers used and reusing the padding:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Transaction/COMMIT 50003 ( 16.89) 2600496 ( 23.38) 0 ( -nan) 2600496 ( 23.38)
CLOG/ZEROPAGE 1 ( 0.00) 28 ( 0.00) 0 ( -nan) 28 ( 0.00)
Standby/RUNNING_XACTS 5 ( 0.00) 248 ( 0.00) 0 ( -nan) 248 ( 0.00)
Heap2/CLEAN 46034 ( 15.55) 1473088 ( 13.24) 0 ( -nan) 1473088 ( 13.24)
Heap2/VISIBLE 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
Heap/INSERT 49682 ( 16.78) 1341414 ( 12.06) 0 ( -nan) 1341414 ( 12.06)
Heap/HOT_UPDATE 150013 ( 50.67) 5700494 ( 51.24) 0 ( -nan) 5700494 ( 51.24)
Heap/INPLACE 5 ( 0.00) 130 ( 0.00) 0 ( -nan) 130 ( 0.00)
Heap/INSERT+INIT 318 ( 0.11) 8586 ( 0.08) 0 ( -nan) 8586 ( 0.08)
Btree/VACUUM 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
-------- -------- -------- --------
Total 296065 11124596 [100.00%] 0 [0.00%] 11124596 [100%

(The FPI percentage display above is arguably borked. Interesting.)

So the average record size is ~37.5 bytes including the increased commit
record size due to the origin information (which is the part that
increases the size for that version that reuses the padding).

This *most definitely* isn't representative of every workload. But it
*is* *a* common type of workload.

Note that --stats will *not* show the size difference in xlog records
when adding data as an additional chunk vs. padding as it uses
XLogRecGetDataLen() to compute the record length... That confused me for
a while.

> That doesn't sound right, 30 bytes is very little.

Well, it's mostly HOT_UPDATES and INSERTS into not indexed tables. So
that's not too surprising. Obviously that'd look different with FPIs
enabled.

> Perhaps the size
> of the records created by pgbench happen to cross a 8-byte alignment
> boundary at that point, making a big difference. In another workload,
> there might be no difference at all, due to alignment.

Right.

> Also, you don't need to tag every record type with the replication ID. All
> indexam records can skip it, for starters, since logical decoding doesn't
> care about them. That should remove a fair amount of bloat.

Yes. I mentioned that. It's additional complexity because now the
decision has to be made at each xlog insertion callsite. Making
refactoring this into a different representation a bit harder. I don't
think it will make that much of a differenced in the above workload
(just CLEAN will be smaller); but it clearly might in others.

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward. I
still don't see a point in using 4 byte identifiers atm, given the above
numbers that just seems like a waste for unrealistic use cases (>2^16
nodes). It's just two lines to change if we feel the need in the future.

Working on fixing the issue with WAL logging of deletions and
rearranging docs as Petr suggested. Not sure if the latter will really
look good, but I guess we'll see ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Introduce-replication-identifiers-v1.1.patch text/x-patch 125.2 KB

From: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 09:04:41
Message-ID: CANP8+jLNByVYJn1Q3scYf-iZ8Q0d4OehzboxP4wnV9roBdqctQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 April 2015 at 09:54, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hrmpf. Says the person that used a lot of padding, without much
> discussion, for the WAL level infrastructure making pg_rewind more
> maintainable.

Sounds bad. What padding are we talking about?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: fix xlogdump percentage display (was Re: Replication identifiers, take 4)
Date: 2015-04-17 09:18:11
Message-ID: 20150417091811.GA14008@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2015-04-17 10:54:51 +0200, andres(at)anarazel(dot)de wrote:
>
> (The FPI percentage display above is arguably borked. Interesting.)

Sorry for the trouble. Patch attached.

-- Abhijit

Attachment Content-Type Size
0001-Don-t-divide-by-zero-when-calculating-percentages.patch text/x-diff 3.9 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 17:12:32
Message-ID: 55313F00.7010609@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 12:04 PM, Simon Riggs wrote:
> On 17 April 2015 at 09:54, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> Hrmpf. Says the person that used a lot of padding, without much
>> discussion, for the WAL level infrastructure making pg_rewind more
>> maintainable.
>
> Sounds bad. What padding are we talking about?

In the new WAL format, the data chunks are stored unaligned, without
padding, to save space. The new format is quite different to the old
one, so it's not straightforward to compare how much that saved. The
fixed-size XLogRecord header is 8 bytes shorter in the new format,
because it doesn't have the xl_len field anymore. But the same
information is stored elsewhere in the record, where it takes 2 or 5
bytes (XLogRecordDataHeaderShort/Long).

But it's a fair point that we could've just made small adjustments to
the old format, without revamping every record type and the way the
block information is stored, and that the space saving of the new format
should be compared with that instead, for a fair comparison.

As an example, one simple thing we could've done with the old format:
remove xl_len, and store the length in place of the two unused padding
bytes instead, as long as it fits in 16 bits. For longer records, set a
flag and store it right after XLogRecord header. For practically all WAL
records, that would've shrunk XLogRecord from 32 to 24 bytes, and made
each record 8 bytes shorter.

I ran the same pgbench test Andres used, with scale 10, and 50000
transactions, and compared the WAL size between master and 9.4:

master: 20738352
9.4: 23915800

According to pg_xlogdump, there were 301153 WAL records. If you take the
9.4 figure, and imagine that we had saved those 8 bytes on each WAL
record, 9.4 would've been 21506576 bytes instead. So yeah, we could've
achieved much of the WAL savings with that much smaller change. That's a
useful thing to compare with.

BTW, those numbers are with wal_level=minimal. With wal_level=logical,
the WAL size from the same test on master was 26503520 bytes. That's
quite a bump. Looking at pg_xlogdump output, it seems that it's all
because the commit records are wider.

- Heikki


From: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 17:36:48
Message-ID: CANP8+jKhb=ahExxmDhFjO+p_-gEOXAY6moytOZSeYczsySGWGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 04/17/2015 12:04 PM, Simon Riggs wrote:
>
>> On 17 April 2015 at 09:54, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>> Hrmpf. Says the person that used a lot of padding, without much
>>> discussion, for the WAL level infrastructure making pg_rewind more
>>> maintainable.
>>>
>>
>> Sounds bad. What padding are we talking about?
>>
>
> In the new WAL format, the data chunks are stored unaligned, without
> padding, to save space. The new format is quite different to the old one,
> so it's not straightforward to compare how much that saved.

The key point here is the whole WAL format was changed to accommodate a
minor requirement for one utility. Please notice that nobody tried to stop
you doing that.

The changes Andres is requesting have a very significant effect on a major
new facility. Perhaps there is concern that it is an external utility?

If we can trust Heikki to include code into core that was written
externally then I think we can do the same for Andres.

I think its time to stop the padding discussion and commit something
useful. We need this.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 18:18:08
Message-ID: 55314E60.3090809@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 08:36 PM, Simon Riggs wrote:
> On 17 April 2015 at 18:12, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
>> On 04/17/2015 12:04 PM, Simon Riggs wrote:
>>
>>> On 17 April 2015 at 09:54, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>>
>>> Hrmpf. Says the person that used a lot of padding, without much
>>>> discussion, for the WAL level infrastructure making pg_rewind more
>>>> maintainable.
>>>
>>> Sounds bad. What padding are we talking about?
>>
>> In the new WAL format, the data chunks are stored unaligned, without
>> padding, to save space. The new format is quite different to the old one,
>> so it's not straightforward to compare how much that saved.
>
> The key point here is the whole WAL format was changed to accommodate a
> minor requirement for one utility. Please notice that nobody tried to stop
> you doing that.
>
> The changes Andres is requesting have a very significant effect on a major
> new facility. Perhaps there is concern that it is an external utility?
>
> If we can trust Heikki to include code into core that was written
> externally then I think we can do the same for Andres.

I'm not concerned of the fact it is an external utility. Well, it
concerns me a little bit, because that means that it'll get little
testing with PostgreSQL. But that has nothing to do with the WAL size
question.

> I think its time to stop the padding discussion and commit something
> useful. We need this.

To be honest, I'm not entirely sure what we're arguing over. I said that
IMO the difference in WAL size is so small that we should just use
4-byte OIDs for the replication identifiers, instead of trying to make
do with 2 bytes. Not because I find it too likely that you'll run out of
IDs (although it could happen), but more to make replication IDs more
like all other system objects we have. Andreas did some pgbench
benchmarking to show that the difference in WAL size is about 10%. The
WAL records generated by pgbench happen to have just the right sizes so
that the 2-3 extra bytes bump them over to the next alignment boundary.
That's why there is such a big difference - on average it'll be less. I
think that's acceptable, Andreas seems to think otherwise. But if the
WAL size really is so precious, we could remove the two padding bytes
from XLogRecord, instead of dedicating them for the replication ids.
That would be an even better use for them.

- Heikki


From: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 20:36:24
Message-ID: CANP8+j+ih6Z86fAmy2J1pLjPLgUGPUYoXmr_XOsGVrmZc3GXTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> To be honest, I'm not entirely sure what we're arguing over.

When arguing over something you consider small, it is customary to allow
the author precedence. We can't do things our own way all the time.

I didn't much like pg_rewind, but it doesn't hurt and you like it, so I
didn't object. We've all got better things to do.

> I said that IMO the difference in WAL size is so small that we should just
> use 4-byte OIDs for the replication identifiers, instead of trying to make
> do with 2 bytes. Not because I find it too likely that you'll run out of
> IDs (although it could happen), but more to make replication IDs more like
> all other system objects we have. Andreas did some pgbench benchmarking to
> show that the difference in WAL size is about 10%. The WAL records
> generated by pgbench happen to have just the right sizes so that the 2-3
> extra bytes bump them over to the next alignment boundary. That's why there
> is such a big difference - on average it'll be less. I think that's
> acceptable, Andreas seems to think otherwise. But if the WAL size really is
> so precious, we could remove the two padding bytes from XLogRecord, instead
> of dedicating them for the replication ids. That would be an even better
> use for them.

The argument to move to 4 bytes is a poor one. If it was reasonable in
terms of code or cosmetic value then all values used in the backend would
be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
that.

The change does nothing useful, since I doubt anyone will ever need >32768
nodes in their cluster.

Increasing WAL size for any non-zero amount is needlessly wasteful for a
change with only cosmetic value. But for a change that has significant
value for database resilience, it is a sensible use of bytes.

+1 to Andres' very reasonable suggestion. Lets commit this and go home.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 20:45:31
Message-ID: 553170EB.60807@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17/04/15 22:36, Simon Riggs wrote:
>
> I said that IMO the difference in WAL size is so small that we
> should just use 4-byte OIDs for the replication identifiers, instead
> of trying to make do with 2 bytes. Not because I find it too likely
> that you'll run out of IDs (although it could happen), but more to
> make replication IDs more like all other system objects we have.
> Andreas did some pgbench benchmarking to show that the difference in
> WAL size is about 10%. The WAL records generated by pgbench happen
> to have just the right sizes so that the 2-3 extra bytes bump them
> over to the next alignment boundary. That's why there is such a big
> difference - on average it'll be less. I think that's acceptable,
> Andreas seems to think otherwise. But if the WAL size really is so
> precious, we could remove the two padding bytes from XLogRecord,
> instead of dedicating them for the replication ids. That would be an
> even better use for them.
>
>
> The argument to move to 4 bytes is a poor one. If it was reasonable in
> terms of code or cosmetic value then all values used in the backend
> would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
> don't do that.
>
> The change does nothing useful, since I doubt anyone will ever need
> >32768 nodes in their cluster.
>

And if they did there would be other much bigger problems than
replication identifier being 16bit (it's actually >65534 as it's
unsigned btw).

Considering the importance and widespread use of replication I think we
should really make sure the related features have small overhead.

> Increasing WAL size for any non-zero amount is needlessly wasteful for a
> change with only cosmetic value. But for a change that has significant
> value for database resilience, it is a sensible use of bytes.
>
> +1 to Andres' very reasonable suggestion. Lets commit this and go home.
>

+1

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-20 08:05:47
Message-ID: 5534B35B.8040309@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 11:36 PM, Simon Riggs wrote:
> On 17 April 2015 at 19:18, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> To be honest, I'm not entirely sure what we're arguing over.
>
> When arguing over something you consider small, it is customary to allow
> the author precedence. We can't do things our own way all the time.

Sure, I'm not going to throw a tantrum if Andres commits this as it is.

>> I said that IMO the difference in WAL size is so small that we should just
>> use 4-byte OIDs for the replication identifiers, instead of trying to make
>> do with 2 bytes. Not because I find it too likely that you'll run out of
>> IDs (although it could happen), but more to make replication IDs more like
>> all other system objects we have. Andreas did some pgbench benchmarking to
>> show that the difference in WAL size is about 10%. The WAL records
>> generated by pgbench happen to have just the right sizes so that the 2-3
>> extra bytes bump them over to the next alignment boundary. That's why there
>> is such a big difference - on average it'll be less. I think that's
>> acceptable, Andreas seems to think otherwise. But if the WAL size really is
>> so precious, we could remove the two padding bytes from XLogRecord, instead
>> of dedicating them for the replication ids. That would be an even better
>> use for them.
>
> The argument to move to 4 bytes is a poor one. If it was reasonable in
> terms of code or cosmetic value then all values used in the backend would
> be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do
> that.

That's a straw man argument. I'm not saying we should never use 2 byte
values anywhere. OID is usually used as the primary key in system
tables. There are exceptions, but that is nevertheless the norm. I'm
saying that saving in WAL size is not worth making an exception here,
and we should go with the simplest option of using OIDs.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-20 08:09:20
Message-ID: 5534B430.4090301@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 11:45 PM, Petr Jelinek wrote:
>> >The argument to move to 4 bytes is a poor one. If it was reasonable in
>> >terms of code or cosmetic value then all values used in the backend
>> >would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we
>> >don't do that.
>> >
>> >The change does nothing useful, since I doubt anyone will ever need
>> > >32768 nodes in their cluster.
>> >
> And if they did there would be other much bigger problems than
> replication identifier being 16bit (it's actually >65534 as it's
> unsigned btw).

Can you name some of the bigger problems you'd have?

Obviously, if you have 100000 high-volume OLTP nodes connected to a
single server, feeding transactions as a continous stream, you're going
to choke the system. But you might have 100000 tiny satellite databases
that sync up with the master every few hours, and each of them do only a
few updates per day.

- Heikki


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-20 08:19:56
Message-ID: 20150420081956.GA25107@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote:
> Can you name some of the bigger problems you'd have?

Several parts of the system are O(#max_replication_slots). Having 100k
outgoing logical replication slots is going to be expensive.

Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be
the biggest problem.

Greetings,

Andres Freund


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-20 08:26:29
Message-ID: 5534B835.1010402@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/17/2015 11:54 AM, Andres Freund wrote:
> I've attached a rebased patch, that adds decision about origin logging
> to the relevant XLogInsert() callsites for "external" 2 byte identifiers
> and removes the pad-reusing version in the interest of moving forward.

Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:

I just realized that it talks about "replication identifier" as the new
fundamental concept. The system table is called
"pg_replication_identifier". But that's like talking about "index
identifiers", instead of just indexes, and calling the system table
pg_index_oid.

The important concept this patch actually adds is the *origin* of each
transaction. That term is already used in some parts of the patch. I
think we should roughly do a search-replace of "replication identifier"
-> "replication origin" to the patch. Or even "transaction origin".

- Heikki


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-20 08:28:02
Message-ID: 20150420082802.GB25107@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> I just realized that it talks about "replication identifier" as the new
> fundamental concept. The system table is called "pg_replication_identifier".
> But that's like talking about "index identifiers", instead of just indexes,
> and calling the system table pg_index_oid.
>
> The important concept this patch actually adds is the *origin* of each
> transaction. That term is already used in some parts of the patch. I think
> we should roughly do a search-replace of "replication identifier" ->
> "replication origin" to the patch. Or even "transaction origin".

Sounds good to me.

Greetings,

Andres Freund


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 07:50:19
Message-ID: CANP8+jLZhntGSQ4FzPAGnPqyAGAfW_oK_zYpKZxB+WyWw47TsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20 April 2015 at 09:28, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> > I just realized that it talks about "replication identifier" as the new
> > fundamental concept. The system table is called
> "pg_replication_identifier".
> > But that's like talking about "index identifiers", instead of just
> indexes,
> > and calling the system table pg_index_oid.
> >
> > The important concept this patch actually adds is the *origin* of each
> > transaction. That term is already used in some parts of the patch. I
> think
> > we should roughly do a search-replace of "replication identifier" ->
> > "replication origin" to the patch. Or even "transaction origin".
>
> Sounds good to me.
>

+1

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 12:08:56
Message-ID: 20150421120856.GG14483@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-20 10:28:02 +0200, Andres Freund wrote:
> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> > I just realized that it talks about "replication identifier" as the new
> > fundamental concept. The system table is called "pg_replication_identifier".
> > But that's like talking about "index identifiers", instead of just indexes,
> > and calling the system table pg_index_oid.
> >
> > The important concept this patch actually adds is the *origin* of each
> > transaction. That term is already used in some parts of the patch. I think
> > we should roughly do a search-replace of "replication identifier" ->
> > "replication origin" to the patch. Or even "transaction origin".
>
> Sounds good to me.

I'm working on changing this (I've implemented the missing WAL
bits). I'd like to discuss the new terms for a sec, before I go and
revise the docs.

I'm now calling the feature 'replication progress tracking'. There's
"replication origins" and there's progress tracking infrastructure that
tracks how far data from a "replication origin" has replicated.

Catalog wise there's an actual table 'pg_replication_origin' that maps
between 'roident' and 'roname'. There's a pg_replication_progress view
(used to be named pg_replication_identifier_progress). I'm not sure if
the latter name isn't too generic? Maybe
pg_logical_replication_progress?

I've now named the functions:

* pg_replication_origin_create
* pg_replication_origin_drop
* pg_replication_origin_get (map from name to id)
* pg_replication_progress_setup_origin : configure session to replicate
from a specific origin
* pg_replication_progress_reset_origin
* pg_replication_progress_setup_tx_details : configure per transaction
details (LSN and timestamp currently)
* pg_replication_progress_is_replaying : Is a origin configured for the session
* pg_replication_progress_advance : "manually" set the replication
progress to a value. Primarily useful for copying values from other
systems and such.
* pg_replication_progress_get : How far did replay progress for a
certain origin
* pg_get_replication_progress : SRF returning the replay progress for
all origin.

Any comments?

Andres


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 15:20:42
Message-ID: 20150421152042.GC4369@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund wrote:

> I'm working on changing this (I've implemented the missing WAL
> bits). I'd like to discuss the new terms for a sec, before I go and
> revise the docs.
>
> I'm now calling the feature 'replication progress tracking'. There's
> "replication origins" and there's progress tracking infrastructure that
> tracks how far data from a "replication origin" has replicated.

Sounds good to me.

> Catalog wise there's an actual table 'pg_replication_origin' that maps
> between 'roident' and 'roname'. There's a pg_replication_progress view
> (used to be named pg_replication_identifier_progress). I'm not sure if
> the latter name isn't too generic? Maybe
> pg_logical_replication_progress?

I think if we wanted "pg_logical_replication_progress" (and I don't
really agree that we do) then we would add the "logical" bit to the
names above as well. This seems unnecessary. pg_replication_progress
seems okay to me.

> I've now named the functions:
>
> * pg_replication_origin_create
> * pg_replication_origin_drop
> * pg_replication_origin_get (map from name to id)
> * pg_replication_progress_setup_origin : configure session to replicate
> from a specific origin
> * pg_replication_progress_reset_origin
> * pg_replication_progress_is_replaying : Is a origin configured for the session
> * pg_replication_progress_advance : "manually" set the replication
> progress to a value. Primarily useful for copying values from other
> systems and such.

These all look acceptable to me.

> * pg_replication_progress_get : How far did replay progress for a
> certain origin
> * pg_get_replication_progress : SRF returning the replay progress for
> all origin.

This combination seems confusing. In some other thread not too long ago
there was the argument that "all functions 'get' something, so that verb
should not appear in the function name". That would call for
"pg_replication_progress" on the singleton. Maybe to distinguish the
SRF, add "all" as a suffix?

> * pg_replication_progress_setup_tx_details : configure per transaction
> details (LSN and timestamp currently)

Not sure about the "tx" here. We use "xact" as an abbreviation for
"transaction" in most places. If nowadays we don't like that term,
maybe just spell out "transaction" in full. I assume this function
pairs up with pg_replication_progress_setup_origin, yes?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 15:30:27
Message-ID: 20150421153027.GJ14483@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Catalog wise there's an actual table 'pg_replication_origin' that maps
> > between 'roident' and 'roname'. There's a pg_replication_progress view
> > (used to be named pg_replication_identifier_progress). I'm not sure if
> > the latter name isn't too generic? Maybe
> > pg_logical_replication_progress?
>
> I think if we wanted "pg_logical_replication_progress" (and I don't
> really agree that we do) then we would add the "logical" bit to the
> names above as well. This seems unnecessary. pg_replication_progress
> seems okay to me.

Cool.

> > * pg_replication_progress_get : How far did replay progress for a
> > certain origin
> > * pg_get_replication_progress : SRF returning the replay progress for
> > all origin.
>
> This combination seems confusing. In some other thread not too long ago
> there was the argument that "all functions 'get' something, so that verb
> should not appear in the function name".

> That would call for "pg_replication_progress" on the singleton.

Hm. I don't like that. That'd e.g. clash with the above view. I think
it's good to distinguish between functions (that have a verb in the
name) and views/tables (that don't).

I agree that the above combination isn't optimal. Although pg_get (and
pg_stat_get) is what's used for a lot of other SRF backed views. Maybe
naming the SRF pg_get_all_replication_progress?

> > * pg_replication_progress_setup_tx_details : configure per transaction
> > details (LSN and timestamp currently)
>
> Not sure about the "tx" here. We use "xact" as an abbreviation for
> "transaction" in most places.

Oh, yea. Xact is more consistent.

> If nowadays we don't like that term, maybe just spell out
> "transaction" in full. I assume this function pairs up with
> pg_replication_progress_setup_origin, yes?

pg_replication_progress_setup_origin sets up the per session state,
setup_xact_details the "per replayed transaction" state.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 20:26:08
Message-ID: CA+TgmoY7cqW6-7798w5qcvZVw91mvVc=xdaGW6Fo2ms4Q4=F9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I've now named the functions:
>
> * pg_replication_origin_create
> * pg_replication_origin_drop
> * pg_replication_origin_get (map from name to id)
> * pg_replication_progress_setup_origin : configure session to replicate
> from a specific origin
> * pg_replication_progress_reset_origin
> * pg_replication_progress_setup_tx_details : configure per transaction
> details (LSN and timestamp currently)
> * pg_replication_progress_is_replaying : Is a origin configured for the session
> * pg_replication_progress_advance : "manually" set the replication
> progress to a value. Primarily useful for copying values from other
> systems and such.
> * pg_replication_progress_get : How far did replay progress for a
> certain origin
> * pg_get_replication_progress : SRF returning the replay progress for
> all origin.
>
> Any comments?

Why are we using functions for this rather than DDL?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-21 20:36:09
Message-ID: 20150421203609.GO14483@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I've now named the functions:
> >
> > * pg_replication_origin_create
> > * pg_replication_origin_drop
> > * pg_replication_origin_get (map from name to id)
> > * pg_replication_progress_setup_origin : configure session to replicate
> > from a specific origin
> > * pg_replication_progress_reset_origin
> > * pg_replication_progress_setup_tx_details : configure per transaction
> > details (LSN and timestamp currently)
> > * pg_replication_progress_is_replaying : Is a origin configured for the session
> > * pg_replication_progress_advance : "manually" set the replication
> > progress to a value. Primarily useful for copying values from other
> > systems and such.
> > * pg_replication_progress_get : How far did replay progress for a
> > certain origin
> > * pg_get_replication_progress : SRF returning the replay progress for
> > all origin.
> >
> > Any comments?
>
> Why are we using functions for this rather than DDL?

Unless I miss something the only two we really could use ddl for is
pg_replication_origin_create/pg_replication_origin_drop. We could use
DDL for them if we really want, but I'm not really seeing the advantage.

Greetings,

Andres Freund


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-22 08:17:50
Message-ID: 5537592E.50507@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21/04/15 22:36, Andres Freund wrote:
> On 2015-04-21 16:26:08 -0400, Robert Haas wrote:
>> On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> I've now named the functions:
>>>
>>> * pg_replication_origin_create
>>> * pg_replication_origin_drop
>>> * pg_replication_origin_get (map from name to id)
>>> * pg_replication_progress_setup_origin : configure session to replicate
>>> from a specific origin
>>> * pg_replication_progress_reset_origin
>>> * pg_replication_progress_setup_tx_details : configure per transaction
>>> details (LSN and timestamp currently)
>>> * pg_replication_progress_is_replaying : Is a origin configured for the session
>>> * pg_replication_progress_advance : "manually" set the replication
>>> progress to a value. Primarily useful for copying values from other
>>> systems and such.
>>> * pg_replication_progress_get : How far did replay progress for a
>>> certain origin
>>> * pg_get_replication_progress : SRF returning the replay progress for
>>> all origin.
>>>
>>> Any comments?
>>
>> Why are we using functions for this rather than DDL?
>
> Unless I miss something the only two we really could use ddl for is
> pg_replication_origin_create/pg_replication_origin_drop. We could use
> DDL for them if we really want, but I'm not really seeing the advantage.
>

I think the only value of having DDL for this would be consistency
(catalog objects are created via DDL) as it looks like something that
will be called only by extensions and not users during normal operation.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-23 12:29:34
Message-ID: 20150423122934.GD3055@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-24 22:22:29 +0100, Petr Jelinek wrote:
> Perhaps we should have some Logical replication developer documentation
> section and put all those three as subsections of that?

So I just played around with this and it didn't find it
worthwhile. Primarily because there's lots of uses of logical decoding
besides building a logical replication solution. I've reverted to
putting it into a separate chapter 'besides' logical decoding.

Greetings,

Andres Freund


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-24 12:32:29
Message-ID: 20150424123229.GQ3055@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
> On 04/17/2015 11:54 AM, Andres Freund wrote:
> >I've attached a rebased patch, that adds decision about origin logging
> >to the relevant XLogInsert() callsites for "external" 2 byte identifiers
> >and removes the pad-reusing version in the interest of moving forward.
>
> Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
>
> I just realized that it talks about "replication identifier" as the new
> fundamental concept. The system table is called "pg_replication_identifier".
> But that's like talking about "index identifiers", instead of just indexes,
> and calling the system table pg_index_oid.
>
> The important concept this patch actually adds is the *origin* of each
> transaction. That term is already used in some parts of the patch. I think
> we should roughly do a search-replace of "replication identifier" ->
> "replication origin" to the patch. Or even "transaction origin".

Attached is a patch that does this, and some more, renaming. That was
more work than I'd imagined. I've also made the internal naming in
origin.c more consistent/simpler and did a bunch of other cleanup.

I'm pretty happy with this state.

Greetings,

Andres Freund

Attachment Content-Type Size
0001-Introduce-replication-progress-tracking-infrastructu.patch text/x-patch 135.1 KB

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-24 13:09:55
Message-ID: 553A40A3.9090604@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/04/15 14:32, Andres Freund wrote:
> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
>> On 04/17/2015 11:54 AM, Andres Freund wrote:
>>> I've attached a rebased patch, that adds decision about origin logging
>>> to the relevant XLogInsert() callsites for "external" 2 byte identifiers
>>> and removes the pad-reusing version in the interest of moving forward.
>>
>> Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
>>
>> I just realized that it talks about "replication identifier" as the new
>> fundamental concept. The system table is called "pg_replication_identifier".
>> But that's like talking about "index identifiers", instead of just indexes,
>> and calling the system table pg_index_oid.
>>
>> The important concept this patch actually adds is the *origin* of each
>> transaction. That term is already used in some parts of the patch. I think
>> we should roughly do a search-replace of "replication identifier" ->
>> "replication origin" to the patch. Or even "transaction origin".
>
> Attached is a patch that does this, and some more, renaming. That was
> more work than I'd imagined. I've also made the internal naming in
> origin.c more consistent/simpler and did a bunch of other cleanup.
>

There are few oversights in the renaming:

doc/src/sgml/func.sgml:
+ Return the replay position for the passed in replication
+ identifier. The parameter <parameter>flush</parameter>

src/include/replication/origin.h:
+ * replication_identifier.h
----
+extern PGDLLIMPORT RepOriginId replident_sesssion_origin;
+extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn;
+extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp;

(these are used then in multiple places in code afterwards and also
mentioned in comment above replorigin_advance)

src/backend/replication/logical/origin.c:
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("replication identiefer
----
+ default:
+ elog(PANIC, "replident_redo: unknown op code
----
+ * This function may only be called if a origin was setup with
+ * replident_session_setup().

I also think the "replident_checkpoint" file should be renamed to
"replorigin_checkpoint".

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-24 20:26:23
Message-ID: 553AA6EF.70209@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/24/15 8:32 AM, Andres Freund wrote:
> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
>> On 04/17/2015 11:54 AM, Andres Freund wrote:
>>> I've attached a rebased patch, that adds decision about origin logging
>>> to the relevant XLogInsert() callsites for "external" 2 byte identifiers
>>> and removes the pad-reusing version in the interest of moving forward.
>>
>> Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming:
>>
>> I just realized that it talks about "replication identifier" as the new
>> fundamental concept. The system table is called "pg_replication_identifier".
>> But that's like talking about "index identifiers", instead of just indexes,
>> and calling the system table pg_index_oid.
>>
>> The important concept this patch actually adds is the *origin* of each
>> transaction. That term is already used in some parts of the patch. I think
>> we should roughly do a search-replace of "replication identifier" ->
>> "replication origin" to the patch. Or even "transaction origin".
>
> Attached is a patch that does this, and some more, renaming. That was
> more work than I'd imagined. I've also made the internal naming in
> origin.c more consistent/simpler and did a bunch of other cleanup.
>
> I'm pretty happy with this state.

Shouldn't this be backed up by pg_dump(all?)?


From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>,Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-24 20:29:55
Message-ID: 8E4505CD-C10A-4117-B174-6A028F5FFBC2@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>On 4/24/15 8:32 AM, Andres Freund wrote:
>> On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote:
>>> On 04/17/2015 11:54 AM, Andres Freund wrote:
>>>> I've attached a rebased patch, that adds decision about origin
>logging
>>>> to the relevant XLogInsert() callsites for "external" 2 byte
>identifiers
>>>> and removes the pad-reusing version in the interest of moving
>forward.
>>>
>>> Putting aside the 2 vs. 4 byte identifier issue, let's discuss
>naming:
>>>
>>> I just realized that it talks about "replication identifier" as the
>new
>>> fundamental concept. The system table is called
>"pg_replication_identifier".
>>> But that's like talking about "index identifiers", instead of just
>indexes,
>>> and calling the system table pg_index_oid.
>>>
>>> The important concept this patch actually adds is the *origin* of
>each
>>> transaction. That term is already used in some parts of the patch. I
>think
>>> we should roughly do a search-replace of "replication identifier" ->
>>> "replication origin" to the patch. Or even "transaction origin".
>>
>> Attached is a patch that does this, and some more, renaming. That was
>> more work than I'd imagined. I've also made the internal naming in
>> origin.c more consistent/simpler and did a bunch of other cleanup.
>>
>> I'm pretty happy with this state.
>
>Shouldn't this be backed up by pg_dump(all?)?

Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-29 02:00:13
Message-ID: 55403B2D.8040401@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/24/15 4:29 PM, Andres Freund wrote:
>> Shouldn't this be backed up by pg_dump(all?)?
>
> Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.

I agree it might not be the best match. But we should consider that we
used to say, a backup by pg_dumpall plus configuration files is a
complete backup. Now we have replication slots and possibly replication
identifiers and maybe similar features in the future that are not
covered by this backup method.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-29 20:03:41
Message-ID: CA+Tgmoa4pVWshbPYKUmgL7VmyBUnddmWw7bXxdLtcaFSuY3hiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 28, 2015 at 10:00 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On 4/24/15 4:29 PM, Andres Freund wrote:
>>> Shouldn't this be backed up by pg_dump(all?)?
>>
>> Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense.
>
> I agree it might not be the best match. But we should consider that we
> used to say, a backup by pg_dumpall plus configuration files is a
> complete backup. Now we have replication slots and possibly replication
> identifiers and maybe similar features in the future that are not
> covered by this backup method.

That's true. But if you did backup the replication slots with
pg_dump, and then you restored them as part of restoring the dump,
your replication setup would be just as broken as if you had never
backed up those replication slots at all.

I think the problem here is that replication slots are part of
*cluster* configuration, not individual node configuration. If you
back up a set of nodes that make up a cluster, and then restore them,
you might hope that you will end up with working slots established
between the same pairs of machines that had working slots between them
before. But I don't see a way to make that happen when you look at it
from the point of view of backing up and restoring just one node. As
we get better clustering facilities into core, we may develop more
instances of this problem; the best way of solving it is not clear to
me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-29 20:12:14
Message-ID: CAApHDvqeJWJtY-=AP6biHn=RHmEUeFVQy6uCGV8yZTMokht84g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25 April 2015 at 00:32, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
> Attached is a patch that does this, and some more, renaming. That was
> more work than I'd imagined. I've also made the internal naming in
> origin.c more consistent/simpler and did a bunch of other cleanup.
>
>
Hi,

It looks like bowerbird is not too happy with this, neither is my build
environment.

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-04-29%2019%3A31%3A06

The attached seems to fix it.
I put the include in the origin.h rather than in the origin.c as the
DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.

Regards

David Rowley

Attachment Content-Type Size
UINT16_MAX_fix.patch application/octet-stream 382 bytes

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-29 20:14:36
Message-ID: 55413BAC.1000901@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29/04/15 22:12, David Rowley wrote:
> On 25 April 2015 at 00:32, Andres Freund <andres(at)anarazel(dot)de
> <mailto:andres(at)anarazel(dot)de>> wrote:
>
>
> Attached is a patch that does this, and some more, renaming. That was
> more work than I'd imagined. I've also made the internal naming in
> origin.c more consistent/simpler and did a bunch of other cleanup.
>
>
> Hi,
>
> It looks like bowerbird is not too happy with this, neither is my build
> environment.
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-04-29%2019%3A31%3A06
>
> The attached seems to fix it.
> I put the include in the origin.h rather than in the origin.c as the
> DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c.
>

I think correct fix is using PG_UINT16_MAX.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services