Replication Node Identifiers and crashsafe Apply Progress

Lists: pgsql-hackers
From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Steve Singer <steve(at)ssinger(dot)info>
Subject: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-14 17:26:32
Message-ID: 20131114172632.GE7522@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

As you know, the reason we are working changeset extraction is that we
want to build logical unidirection and bidirectional replication
ontop. To use changeset extraction effectively, I think one set of
related features ontop is very useful:

When extracting changes using the changeset extraction patchset (latest
version at [1]) the START_LOGICAL_REPLICATION command is used to stream
changes from a source system. When started it will continue to send
changes as long as the connection is up or it is aborted. For obvious
performance reasons it will *not* wait for an ACK for each transaction
commit it streams out.
Instead it relies on the receiver, exactly as in physical replication,
sending feedback messages containing the LSN up to which data has safely
been received.
That means frequently something like:
walsender: => COMMIT 0/10000000
walsender: => COMMIT 0/10000200
walsender: => COMMIT 0/10000400
walsender: => COMMIT 0/10000600
receiver: <= ACKNOWLEDGE 0/10000270
walsender: => COMMIT 0/10000800
is possible and important for performance. I.e. the server has streamed
out more changes than it got confirmation for.

So, when the the replication connection goes down, e.g. because the
receiving side has crashed, we need to tell the server from where to
start. Every position between the last ACKed and the end of WAL is
legal.
The receiver then can ask the source to start replication from the last
replayed commit using START_LOGICAL_REPLICATION 'slot_name'
'0/10000600' which would then re-stream all the changes in the
transaction that committe at 0/10000600 and all that follow.

But for that the receiving side needs to know up to where changes have
been applied. One relatively easy solution for that is that the
receiving side does something like:
UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
before the end of every replayed transaction. But that obviously will
quickly cause bloat.

Our solution to that is that a replaying process can tell the backend
that it is currently doing so and setup three variables for every
transaction:
1) an identifier for the the source database
2) the LSN at which the replayed transaction has committed remotely
3) the time at which the replayed transaction has committed remotely

When the transaction then commits the commit record will set the
XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
of the commit record. During crash recovery the startup process will
remember the newest LSN for each remote database in shared memory.

This way, after a crash, restart, disconnect the replay process can look
into shared memory and check how far it has already replayed and restart
seamlessly. With minimal effort.

We previously discussed the topic and some were very adverse to using
any sort of numeric node identifiers across systems and suggested that
those should only be used internally. So what the attached patch does is
to add a new shared system catalog called 'pg_replication_identifier'
(suggestions for a better name welcome) which translates a number of
identifying traits into a numeric identifier.
The set of identifiers currently are:
* the sysid of the remote system, combined with the remote TLI
* the oid of the local database
* the oid of the remote database
* an optional name
but that's just what we needed in our multimaster prototype, and not
what I necessarily think is correct.

The added API (which surely need some work, I am not particularly happy
with the naming of functions for one) basically consists of two parts:
1) functions to query/create replication identifiers:
* GetReplicationIdentifier(identifying traits) - search for a numeric replication identifier
* CreateReplicationIdentifier(identifying traits) - creates a numeric replication identifier
* GetReplicationInfoByIdentifier(numeric identifier) - returns identifying traits

2) functions to query/manipulate replication progress:
* AdvanceReplicationIdentifier(node, local_lsn, remote_lsn)
* XLogRecPtr RemoteCommitFromReplicationIdentifier(node)

Internally the code also maintains some on-disk data which is updated
during checkpoints to store the replication progress, otherwise it'd
vanish if we shutdown gracefully ;).

The attached code also integrates with the "commit timestamp" module
that Alvaro submitted ([2]). Everytime a remote transaction is committed
we store a) the remote commit's timestamp, b) the origin node id in it.
That allows to relatively easily build multimaster systems with conflict
resolution ontop, since whenever there's a conflict the originating
node, and originating commit timestamp for a row can be queried
efficiently.

Having information about the origin of a change/transaction allows to
implement complex replication topologies since the information is
available to changeset extration output plugins.
It allows to do write plugins that:
* decode all changes, independent from the system they were originally
executed on by the user
* decode changes generated locally, but none from remote systems
* pick and choose between those, say only decode those the receiving
system isn't replicating from itself

Questions are:
* Which identifying traits do we want to use to identify nodes?
* How do we want to manipulate replication identifiers? Currently they
can only be manipulated by using C functions, which is fine for some users,
but probably not for others?
* Do we want to allow setting (remote_lsn, remote_timestamp,
remote_node_id) via SQL? Currently the remote_node_id can be set as a
GUC, but the other's can't. They probably should be a function that
can be called instead of GUCs?
* Suggestions for better names!
* Would slony et al need something ontop to use this?

Todo:
* cleanup/naming
* Set returning function to see the replication progress
* remove old checkpoint files

Note that this only applies a) ontop the changeset extraction code b)
the commit timestamp code. The 'replication-identifiers' git branch
([3]) contains all integrated together.

Comments welcome!

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20131114134647.GA26172%40alap2.anarazel.de
[2] http://archives.postgresql.org/message-id/20131022221600.GE4987%40eldon.alvh.no-ip.org
[3] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/replication-identifiers
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Introduce-replication_identifiers-to-keep-track-of-r.patch text/x-patch 51.9 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 12:40:30
Message-ID: CA+TgmoaMAbzJiRoUPWOKYeyk5-615bxw-nrvtubWysVqM6-XCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 14, 2013 at 12:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> As you know, the reason we are working changeset extraction is that we
> want to build logical unidirection and bidirectional replication
> ontop. To use changeset extraction effectively, I think one set of
> related features ontop is very useful:
>
> When extracting changes using the changeset extraction patchset (latest
> version at [1]) the START_LOGICAL_REPLICATION command is used to stream
> changes from a source system. When started it will continue to send
> changes as long as the connection is up or it is aborted. For obvious
> performance reasons it will *not* wait for an ACK for each transaction
> commit it streams out.
> Instead it relies on the receiver, exactly as in physical replication,
> sending feedback messages containing the LSN up to which data has safely
> been received.
> That means frequently something like:
> walsender: => COMMIT 0/10000000
> walsender: => COMMIT 0/10000200
> walsender: => COMMIT 0/10000400
> walsender: => COMMIT 0/10000600
> receiver: <= ACKNOWLEDGE 0/10000270
> walsender: => COMMIT 0/10000800
> is possible and important for performance. I.e. the server has streamed
> out more changes than it got confirmation for.
>
> So, when the the replication connection goes down, e.g. because the
> receiving side has crashed, we need to tell the server from where to
> start. Every position between the last ACKed and the end of WAL is
> legal.
> The receiver then can ask the source to start replication from the last
> replayed commit using START_LOGICAL_REPLICATION 'slot_name'
> '0/10000600' which would then re-stream all the changes in the
> transaction that committe at 0/10000600 and all that follow.
>
> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.
>
> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
> 1) an identifier for the the source database
> 2) the LSN at which the replayed transaction has committed remotely
> 3) the time at which the replayed transaction has committed remotely
>
> When the transaction then commits the commit record will set the
> XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
> of the commit record. During crash recovery the startup process will
> remember the newest LSN for each remote database in shared memory.
>
> This way, after a crash, restart, disconnect the replay process can look
> into shared memory and check how far it has already replayed and restart
> seamlessly. With minimal effort.

It would be much less invasive for the replication apply code to fsync
its own state on the apply side. Obviously, that means doubling the
fsync rate, which is not appealing, but I think that's still a useful
way to think about what you're aiming to accomplish here: avoid
doubling the fsync rate when applying remote transactions in a
crash-safe manner.

Although I agree that we need a way to do that, I don't have a
particularly warm and fuzzy feeling about this particular proposal:
there are too many bits of it that feel like entirely arbitrary design
decisions. If we're going to build a full-fledged logical replication
solution into core, attempting to obsolete Slony and Bucardo and
Londiste and everything that's out there, then I think we have a great
deal of design work that we have to do before we start committing
things, or even finalizing designs. If we're going to continue with
the philosophy of building a toolkit that can serve as a building
block for multiple solutions, then color me unconvinced that this will
do the job.

If we made the xlog system truly extensible, that seems like it'd
punch your ticket here. I'm not sure how practical that is, though.

> We previously discussed the topic and some were very adverse to using
> any sort of numeric node identifiers across systems and suggested that
> those should only be used internally. So what the attached patch does is
> to add a new shared system catalog called 'pg_replication_identifier'
> (suggestions for a better name welcome) which translates a number of
> identifying traits into a numeric identifier.
> The set of identifiers currently are:
> * the sysid of the remote system, combined with the remote TLI
> * the oid of the local database
> * the oid of the remote database
> * an optional name
> but that's just what we needed in our multimaster prototype, and not
> what I necessarily think is correct.

The fact that you've included both local and remote database OIDs
seems wrong; shouldn't the replication identifier only serve to
identify the source node, not the replication stream? What if you
want to replicate from table A to table B within the same database?
(e.g. so that you can lock them for DDL changes in alternation) What
if you want to replicate tables A and B in one database into a
database on another node, but making the two of them independent
replication streams - e.g. because we know that transactions on table
A will be large but not latency-sensitive, but transactions on B will
be small but highly sensitive to replication delay? What if we want
to use changeset extraction to produce delta relations on the base
tables for a materialized view, feed them through a bunch of
relational algebra, and apply the resulting changes to the MV, keeping
track of how far we got?

We need some kind of pretty flexible system here, if we're not to box
ourselves into a corner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <stark(at)mit(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 14:16:04
Message-ID: CAM-w4HPEaQ0eCPLztryW=4dON0UCA_+YFnc5879TUeGk14ymDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 14, 2013 at 5:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.
>
> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
>

This is a pretty massive design decision to hinge on such a minor
implementation detail of table bloat (which I don't think would actually be
an issue anyway -- isn't that what we have HOT for?)

Fundamentally the question here is where to keep all the book-keeping state
about replicas, in a central repository in the master or locally in each
replica. At first blush it seems obvious to me that locally in each replica
is the more flexible choice.

Replication systems become complex when you start restoring from old
backups and not every node has the same view of the topology as every other
node. I fear what will happen to a central repository when you fail over
the master and it's out of sync with where the slaves have actually
restored up to. Or where you fail over a slave to a standby of the slave
and it needs to redo some of the logical replication to catch up. Or where
you restore all your nodes, both master and slaves from backups taken at
different points in time (presumably with the master ahead of the slaves).

Having a central repository makes the whole system simpler but it also
makes it much more fragile. It's nice to have a single place to go to find
out what the state of every replica is but it should do that by actually
asking the replicas, not by maintaining state that might be out of sync.

--
greg


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 16:57:22
Message-ID: 20131119165722.GF19293@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-19 07:40:30 -0500, Robert Haas wrote:
> > This way, after a crash, restart, disconnect the replay process can look
> > into shared memory and check how far it has already replayed and restart
> > seamlessly. With minimal effort.
>
> It would be much less invasive for the replication apply code to fsync
> its own state on the apply side. Obviously, that means doubling the
> fsync rate, which is not appealing, but I think that's still a useful
> way to think about what you're aiming to accomplish here: avoid
> doubling the fsync rate when applying remote transactions in a
> crash-safe manner.

Exactly.

> Although I agree that we need a way to do that, I don't have a
> particularly warm and fuzzy feeling about this particular proposal:
> there are too many bits of it that feel like entirely arbitrary design
> decisions. If we're going to build a full-fledged logical replication
> solution into core, attempting to obsolete Slony and Bucardo and
> Londiste and everything that's out there, then I think we have a great
> deal of design work that we have to do before we start committing
> things, or even finalizing designs. If we're going to continue with
> the philosophy of building a toolkit that can serve as a building
> block for multiple solutions, then color me unconvinced that this will
> do the job.

Imo we actually want and need both, wanting something builtin doesn't
preclude important usecases that need to be served by other solutions.

I think - while the API certainly needs work - the general idea
integrates pretty well with the pretty generic changeset extraction
mechanism and possible solutions replication between postgres servers.

Note that this really is a draft of what I think is needed, written
after the experience of developing a solution for the problem in a
specific replication solution and talking to some people implementing
replication solutions. Maybe somebody has a far better idea to implement
this: I am all ears!

> If we made the xlog system truly extensible, that seems like it'd
> punch your ticket here. I'm not sure how practical that is, though.

I don't think it is.

> > We previously discussed the topic and some were very adverse to using
> > any sort of numeric node identifiers across systems and suggested that
> > those should only be used internally. So what the attached patch does is
> > to add a new shared system catalog called 'pg_replication_identifier'
> > (suggestions for a better name welcome) which translates a number of
> > identifying traits into a numeric identifier.
> > The set of identifiers currently are:
> > * the sysid of the remote system, combined with the remote TLI
> > * the oid of the local database
> > * the oid of the remote database
> > * an optional name
> > but that's just what we needed in our multimaster prototype, and not
> > what I necessarily think is correct.
>
> The fact that you've included both local and remote database OIDs
> seems wrong; shouldn't the replication identifier only serve to
> identify the source node, not the replication stream? What if you
> want to replicate from table A to table B within the same database?

The reason I chose those parameters is that they avoid the need for a
human to assign identifiers in many situations since they already are
unique. For the cases where they aren't I've included the "name" to
distinguish several streams.

The reason both source and target database are included is that it
avoids manual work if you want to replicate between two databases in
both directions.

> We need some kind of pretty flexible system here, if we're not to box
> ourselves into a corner.

Agreed. As an alternative we could just have a single - probably longer
than NAMEDATALEN - string to identify replication progress and rely on
the users of the facility to build the identifier automatically
themselves using components that are helpful in their system.

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 17:47:29
Message-ID: CA+TgmoZDrV3kWGXzC5=4NwSpjWWj5UrSEqJpG0bX0nASh1p+7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 19, 2013 at 11:57 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Agreed. As an alternative we could just have a single - probably longer
> than NAMEDATALEN - string to identify replication progress and rely on
> the users of the facility to build the identifier automatically
> themselves using components that are helpful in their system.

I tend to feel like a generic identifier would be better. I'm not
sure why something like a UUID wouldn't be enough, though.
Arbitrary-length identifiers will be bad for performance, and 128 bits
ought to be enough for anyone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 18:20:06
Message-ID: 20131119182006.GA7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 12:47:29 -0500, Robert Haas wrote:
> On Tue, Nov 19, 2013 at 11:57 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Agreed. As an alternative we could just have a single - probably longer
> > than NAMEDATALEN - string to identify replication progress and rely on
> > the users of the facility to build the identifier automatically
> > themselves using components that are helpful in their system.
>
> I tend to feel like a generic identifier would be better. I'm not
> sure why something like a UUID wouldn't be enough, though.
> Arbitrary-length identifiers will be bad for performance, and 128 bits
> ought to be enough for anyone.

That's what I had suggested to some people originally and the response
was, well, somewhat unenthusiastic. It's not that easy to assign them in
a meaningful automated manner. How do you automatically assign a pg
cluster an id?
But yes, maybe the answer is to balk on that part, let the users figure
out what's best, and then only later implement more policy based on that
experience.

WRT performance: I agree that fixed-width identifiers are more
performant, that's why I went for them, but I am not sure it's that
important. The performance sensitive parts should all be done using the
internal id the identifier maps to, not the public one.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 23:49:27
Message-ID: BLU0-SMTP40787261A4CCD7A35F1028DCE70@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/14/2013 12:26 PM, Andres Freund wrote:
> Hello,
>
> As you know, the reason we are working changeset extraction is that we
> want to build logical unidirection and bidirectional replication
> ontop. To use changeset extraction effectively, I think one set of
> related features ontop is very useful:
>
> When extracting changes using the changeset extraction patchset (latest
> version at [1]) the START_LOGICAL_REPLICATION command is used to stream
> changes from a source system. When started it will continue to send
> changes as long as the connection is up or it is aborted. For obvious
> performance reasons it will *not* wait for an ACK for each transaction
> commit it streams out.
> Instead it relies on the receiver, exactly as in physical replication,
> sending feedback messages containing the LSN up to which data has safely
> been received.
> That means frequently something like:
> walsender: => COMMIT 0/10000000
> walsender: => COMMIT 0/10000200
> walsender: => COMMIT 0/10000400
> walsender: => COMMIT 0/10000600
> receiver: <= ACKNOWLEDGE 0/10000270
> walsender: => COMMIT 0/10000800
> is possible and important for performance. I.e. the server has streamed
> out more changes than it got confirmation for.
>
> So, when the the replication connection goes down, e.g. because the
> receiving side has crashed, we need to tell the server from where to
> start. Every position between the last ACKed and the end of WAL is
> legal.
> The receiver then can ask the source to start replication from the last
> replayed commit using START_LOGICAL_REPLICATION 'slot_name'
> '0/10000600' which would then re-stream all the changes in the
> transaction that committe at 0/10000600 and all that follow.
>
> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.

I don't see how this is going to cause any more bloat than what
trigger-based slony does today with sl_confirm and I don't hear a lot of
complaints about that being a big problem. This might be because slony
doesn't do a commit on the replica for every transaction but groups the
transactions together, logical slony will behave the same way where we
would only commit on SYNC transactions.

> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
> 1) an identifier for the the source database
> 2) the LSN at which the replayed transaction has committed remotely
> 3) the time at which the replayed transaction has committed remotely
>
> When the transaction then commits the commit record will set the
> XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
> of the commit record. During crash recovery the startup process will
> remember the newest LSN for each remote database in shared memory.
>
> This way, after a crash, restart, disconnect the replay process can look
> into shared memory and check how far it has already replayed and restart
> seamlessly. With minimal effort.
>
> We previously discussed the topic and some were very adverse to using
> any sort of numeric node identifiers across systems and suggested that
> those should only be used internally. So what the attached patch does is
> to add a new shared system catalog called 'pg_replication_identifier'
> (suggestions for a better name welcome) which translates a number of
> identifying traits into a numeric identifier.
> The set of identifiers currently are:
> * the sysid of the remote system, combined with the remote TLI
> * the oid of the local database
> * the oid of the remote database
> * an optional name
> but that's just what we needed in our multimaster prototype, and not
> what I necessarily think is correct.
>
> The added API (which surely need some work, I am not particularly happy
> with the naming of functions for one) basically consists of two parts:
> 1) functions to query/create replication identifiers:
> * GetReplicationIdentifier(identifying traits) - search for a numeric replication identifier
> * CreateReplicationIdentifier(identifying traits) - creates a numeric replication identifier
> * GetReplicationInfoByIdentifier(numeric identifier) - returns identifying traits
>
> 2) functions to query/manipulate replication progress:
> * AdvanceReplicationIdentifier(node, local_lsn, remote_lsn)
> * XLogRecPtr RemoteCommitFromReplicationIdentifier(node)
>
> Internally the code also maintains some on-disk data which is updated
> during checkpoints to store the replication progress, otherwise it'd
> vanish if we shutdown gracefully ;).
>
> The attached code also integrates with the "commit timestamp" module
> that Alvaro submitted ([2]). Everytime a remote transaction is committed
> we store a) the remote commit's timestamp, b) the origin node id in it.
> That allows to relatively easily build multimaster systems with conflict
> resolution ontop, since whenever there's a conflict the originating
> node, and originating commit timestamp for a row can be queried
> efficiently.
>
> Having information about the origin of a change/transaction allows to
> implement complex replication topologies since the information is
> available to changeset extration output plugins.
> It allows to do write plugins that:
> * decode all changes, independent from the system they were originally
> executed on by the user
> * decode changes generated locally, but none from remote systems
> * pick and choose between those, say only decode those the receiving
> system isn't replicating from itself
>
> Questions are:
> * Which identifying traits do we want to use to identify nodes?
> * How do we want to manipulate replication identifiers? Currently they
> can only be manipulated by using C functions, which is fine for some users,
> but probably not for others?
> * Do we want to allow setting (remote_lsn, remote_timestamp,
> remote_node_id) via SQL? Currently the remote_node_id can be set as a
> GUC, but the other's can't. They probably should be a function that
> can be called instead of GUCs?

A way of advancing the replication pointer via SQL would be nice,
otherwise I'll just have to write my own C function that I will invoke
via SQL (which sin't hard but everyone would need to do the same)

> * Suggestions for better names!
> * Would slony et al need something ontop to use this?

In the slony world we identifer nodes with a 32 bit integer. I think
the idea is that I'm going to have to pass arguments into

+extern RepNodeId GetReplicationIdentifier(uint64 remotesysid, Oid remotetli,
+ Oid remotedb, Name riname,
+ Oid rilocaldb);

to map my slony concept of a node id to a 16 bit "node id" that is only
useful on the local system. In additon to a slony 32 bit node id I have
a conninfo that I can use to contact that node. I think the slon would
need to connect to the remote node with that conninfo (which it does
anyway) and get the remote oid's and then use the slony node_id
converted to a string as the "riname" value.
I would then have to invoke AdvanceReplicationIdentifier at some point
before I issue the commit.

What does building up node_id key from

(sysid, tlid, remote_dbid, local_dbid, name) get us over just mapping from an arbitrary name field to a 16 bit node_id ?
I agree with the other comments on the thread that letting the replication system figure out its own unique naming is better. If we were going t come up with a schema then I am also not sure if using the remote TLI as part of the node key is a good idea.

> Todo:
> * cleanup/naming
> * Set returning function to see the replication progress
> * remove old checkpoint files
>
> Note that this only applies a) ontop the changeset extraction code b)
> the commit timestamp code. The 'replication-identifiers' git branch
> ([3]) contains all integrated together.
>
> Comments welcome!
>
> Greetings,
>
> Andres Freund
>
> [1]http://archives.postgresql.org/message-id/20131114134647.GA26172%40alap2.anarazel.de
> [2]http://archives.postgresql.org/message-id/20131022221600.GE4987%40eldon.alvh.no-ip.org
> [3]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/replication-identifiers
>
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-20 20:05:17
Message-ID: CA+TgmoY0noN36JQkHNR86gq5Hr=9g=xGF8ckwMvV1xEopTAt=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 19, 2013 at 1:20 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-19 12:47:29 -0500, Robert Haas wrote:
>> On Tue, Nov 19, 2013 at 11:57 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Agreed. As an alternative we could just have a single - probably longer
>> > than NAMEDATALEN - string to identify replication progress and rely on
>> > the users of the facility to build the identifier automatically
>> > themselves using components that are helpful in their system.
>>
>> I tend to feel like a generic identifier would be better. I'm not
>> sure why something like a UUID wouldn't be enough, though.
>> Arbitrary-length identifiers will be bad for performance, and 128 bits
>> ought to be enough for anyone.
>
> That's what I had suggested to some people originally and the response
> was, well, somewhat unenthusiastic. It's not that easy to assign them in
> a meaningful automated manner. How do you automatically assign a pg
> cluster an id?

/dev/urandom

> But yes, maybe the answer is to balk on that part, let the users figure
> out what's best, and then only later implement more policy based on that
> experience.
>
> WRT performance: I agree that fixed-width identifiers are more
> performant, that's why I went for them, but I am not sure it's that
> important. The performance sensitive parts should all be done using the
> internal id the identifier maps to, not the public one.

But I thought the internal identifier was exactly what we're creating.

I think we should also take note of Steve Singer's comments. Perhaps
this entire endeavor is premature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 11:04:55
Message-ID: 20131121110455.GH7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-19 18:49:27 -0500, Steve Singer wrote:
> >But for that the receiving side needs to know up to where changes have
> >been applied. One relatively easy solution for that is that the
> >receiving side does something like:
> >UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> >before the end of every replayed transaction. But that obviously will
> >quickly cause bloat.
>
> I don't see how this is going to cause any more bloat than what
> trigger-based slony does today with sl_confirm and I don't hear a lot of
> complaints about that being a big problem.

FWIW, bloat on slony's tables (including sl_confirm) is one of the major
reasons I've seen people move away from slony for production, and use it
only for upgrades.
It's only really a problem if you have longrunning transactions on the
standby, but that's a pretty major use-case of having replicas.

> This might be because slony doesn't do a commit on the replica for
> every transaction but groups the transactions together, logical slony
> will behave the same way where we would only commit on SYNC
> transactions.

But yes, the grouping of transactions certainly makes for a major
difference. I don't think we want to force solutions to commit
transactions in batches. Not the least because that obviously prohibits
using a standby as a synchronous replica.

> >* Do we want to allow setting (remote_lsn, remote_timestamp,
> > remote_node_id) via SQL? Currently the remote_node_id can be set as a
> > GUC, but the other's can't. They probably should be a function that
> > can be called instead of GUCs?
>
> A way of advancing the replication pointer via SQL would be nice, otherwise
> I'll just have to write my own C function that I will invoke via SQL (which
> sin't hard but everyone would need to do the same)

But don't you already essentially perform the actual inserts via C in
new slonys? That's mainly the reason I wasn't sure it's needed.

But then, providing a function to do that setup isn't hard.

> What does building up node_id key from (sysid, tlid, remote_dbid,
> local_dbid, name) get us over just mapping from an arbitrary name
> field to a 16 bit node_id ?

It avoids the need to manually assign ids to systems in many cases. I've
seen people complain about that a fair bit.

But it seems pretty clear that a more arbitrary identifier is preferred
so far, so I'll go for that.

Thanks for the comments,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 11:12:27
Message-ID: 20131121111227.GI7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-19 14:16:04 +0000, Greg Stark wrote:
> On Thu, Nov 14, 2013 at 5:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
>
> > But for that the receiving side needs to know up to where changes have
> > been applied. One relatively easy solution for that is that the
> > receiving side does something like:
> > UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> > before the end of every replayed transaction. But that obviously will
> > quickly cause bloat.
> >
> > Our solution to that is that a replaying process can tell the backend
> > that it is currently doing so and setup three variables for every
> > transaction:
> >
>
> This is a pretty massive design decision to hinge on such a minor
> implementation detail of table bloat (which I don't think would actually be
> an issue anyway -- isn't that what we have HOT for?)

Not sure what HOT is going to help with? Even with HOT we can only
remove tuples that are invisible to everyone. If there are longrunning
queries on the standby - and running analytics on standbys is a rather
frequent use-case - that won't be the case for a long, long time.

> Fundamentally the question here is where to keep all the book-keeping state
> about replicas, in a central repository in the master or locally in each
> replica. At first blush it seems obvious to me that locally in each replica
> is the more flexible choice.

This really is about storing the state of apply on each replica
efficiently.

Imagine the standby just received data for a transaction x and has
replayed it locally. If it crashes in that moment, it needs to know
whether that transaction has successfully committed or not. And that has
to work even if the commit succeeded internally but hasn't yet returned
success!
So, what this provides is a facility to say 'hey, this local transaction
was at X on the source Y' and a way to get the last X for each Y at the
end of crash recovery.
Then the replication solution can restart replication from X onwards for
each Y.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 11:15:41
Message-ID: 20131121111541.GJ7240@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-20 15:05:17 -0500, Robert Haas wrote:
> > That's what I had suggested to some people originally and the response
> > was, well, somewhat unenthusiastic. It's not that easy to assign them in
> > a meaningful automated manner. How do you automatically assign a pg
> > cluster an id?
>
> /dev/urandom

Well yes. But then you need a way to store and change that random id for
each cluster.

Anyway, the preference is clear, so I am going to go for that in v2. I
am not sure about the type of the public identifier yet, I'll think a
bit about it.

> > But yes, maybe the answer is to balk on that part, let the users figure
> > out what's best, and then only later implement more policy based on that
> > experience.
> >
> > WRT performance: I agree that fixed-width identifiers are more
> > performant, that's why I went for them, but I am not sure it's that
> > important. The performance sensitive parts should all be done using the
> > internal id the identifier maps to, not the public one.
>
> But I thought the internal identifier was exactly what we're creating.

Sure. But how often are we a) going to create such an identifier b)
looking it up? Hopefully both will be rather infrequent operations.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 13:22:05
Message-ID: CA+TgmoZWjfqdrDYTDH=oU0yh+mOuqiv3GORtraQrPPft79_Ykg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 21, 2013 at 6:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > WRT performance: I agree that fixed-width identifiers are more
>> > performant, that's why I went for them, but I am not sure it's that
>> > important. The performance sensitive parts should all be done using the
>> > internal id the identifier maps to, not the public one.
>>
>> But I thought the internal identifier was exactly what we're creating.
>
> Sure. But how often are we a) going to create such an identifier b)
> looking it up?

Never. Make that the replication solution's problem. Make the core
support deal only with UUIDs or pairs of 64-bit integers or something
like that, and let the replication solution decide what they mean.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 13:26:02
Message-ID: 20131121132602.GA2592@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-21 08:22:05 -0500, Robert Haas wrote:
> On Thu, Nov 21, 2013 at 6:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> > WRT performance: I agree that fixed-width identifiers are more
> >> > performant, that's why I went for them, but I am not sure it's that
> >> > important. The performance sensitive parts should all be done using the
> >> > internal id the identifier maps to, not the public one.
> >>
> >> But I thought the internal identifier was exactly what we're creating.
> >
> > Sure. But how often are we a) going to create such an identifier b)
> > looking it up?
>
> Never. Make that the replication solution's problem. Make the core
> support deal only with UUIDs or pairs of 64-bit integers or something
> like that, and let the replication solution decide what they mean.

I think we're misunderstanding each other. I was commenting on your fear
that strings longer than NAMEDATALEN or something would be bad for
performance - which I don't think is very relevant because the lookups
from "public" to "internal" identifier shouldn't be in any performance
critical path.

I personally would prefer a string because it'd allow me to build an
identifier using the criterions I'd originally outlined outside of this
infrastructure.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-21 22:03:52
Message-ID: 528E8348.8060203@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/13, 6:15 AM, Andres Freund wrote:
> On 2013-11-20 15:05:17 -0500, Robert Haas wrote:
>>> That's what I had suggested to some people originally and the response
>>> was, well, somewhat unenthusiastic. It's not that easy to assign them in
>>> a meaningful automated manner. How do you automatically assign a pg
>>> cluster an id?
>>
>> /dev/urandom
>
> Well yes. But then you need a way to store and change that random id for
> each cluster.

You can use a v3 UUID, which is globally reproducible.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-22 19:43:15
Message-ID: CA+TgmoZ_duATCo3givHmZ0W95O4Vg+oW1i+=uPyyizeqeWyh6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 21, 2013 at 8:26 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-11-21 08:22:05 -0500, Robert Haas wrote:
>> On Thu, Nov 21, 2013 at 6:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> > WRT performance: I agree that fixed-width identifiers are more
>> >> > performant, that's why I went for them, but I am not sure it's that
>> >> > important. The performance sensitive parts should all be done using the
>> >> > internal id the identifier maps to, not the public one.
>> >>
>> >> But I thought the internal identifier was exactly what we're creating.
>> >
>> > Sure. But how often are we a) going to create such an identifier b)
>> > looking it up?
>>
>> Never. Make that the replication solution's problem. Make the core
>> support deal only with UUIDs or pairs of 64-bit integers or something
>> like that, and let the replication solution decide what they mean.
>
> I think we're misunderstanding each other. I was commenting on your fear
> that strings longer than NAMEDATALEN or something would be bad for
> performance - which I don't think is very relevant because the lookups
> from "public" to "internal" identifier shouldn't be in any performance
> critical path.
>
> I personally would prefer a string because it'd allow me to build an
> identifier using the criterions I'd originally outlined outside of this
> infrastructure.

Yeah, there's some confusion here. I don't care at all about the
performance characteristics of long strings here, because we shouldn't
be using them anywhere in the core code. What I do care about is
making sure that whatever core support we use here is agnostic to how
the internal identifiers - relatively short bit strings - are
generated. The patch as proposed puts forward a particular way of
doing that, and I think that neither that method *nor any other*
should be part of core.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-22 19:55:21
Message-ID: 20131122195521.GD17400@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-22 14:43:15 -0500, Robert Haas wrote:
> The patch as proposed puts forward a particular way of
> doing that, and I think that neither that method *nor any other*
> should be part of core.

Working on something like that, updated the patch state to "waiting on
author".

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services