Re: Logical replication existing data copy

Lists: pgsql-hackers
From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Logical replication existing data copy
Date: 2016-12-19 09:30:07
Message-ID: 3caaecb9-6c1e-7824-4004-97b89b449096@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

as the logical replication patch is progressing closer to integration it
seems to be good time to revisit what to do about preexisting data.

This patch implements data synchronization for the logical replication.
It works both for initial setup of subscription as well as for tables
added later to replication when the subscription is already active.

There is new catalog pg_subscription_rel which track
synchronization/replication state of each table that is subscribed via
publications.

From the user perspective it adds some additional options/clauses to
CREATE/ALTER SUBSCRIPTION namely:
- CREATE SUBSCRIPTION ... [ WITH (COPY DATA | NOCOPY DATA) ] where user
can specify if initial data should be copied or not, with the default COPY.
- ALTER SUBSCRIPTION ... REFRESH PUBLICATION [ WITH (COPY DATA | NOCOPY
DATA) ] which updates the local info about tables in the subscribed
publication again with option to either copy or not copy data.
- Also this WITH (COPY DATA | NOCOPY DATA) ] was added to ALTER
SUBSCRIPTION ... SET PUBLICATION which automatically triggers the
REFRESH PUBLICATION as well.

So existing table data can be copied once subscription is created, but
also new tables can be added and their data copied. This is where the
REFRESH PUBLICATION comes into play. Adding table to publication does
not make it automatically replicated by the subscription as the
subscription does not have tracking info for that table. So to add new
table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
subscriber otherwise the data won't be replicated.

The copy proccess runs in parallel with normal replication and is
visible in the pg_stat_subscription view. There is also new GUC
max_subscription_sync_workers which specifies how many tables can be
synchronized in parallel.

Implementation-wise this adds modified apply process called sync
(tablesync.c) which at the beginning creates temporary slot, copies data
from it, then synces to the correct LSN with the main apply and exits
(this is actually quite complex but it's explained in the tablesync.c
header).

Standard COPY TO/FROM is used. This is done by enhancing COPY FROM to
accept data from callback function which the tablesync implements.

On the publisher side, the walsender is enhanced to accept standard SQL
(simple query protocol only) so that COPY can be run. This also helps
getting info about table using plain SELECT. The way the SQL works is
that the parser was changed to accept strings that are not valid
replication protocol commands and if such string comes it's sent to
exec_simple_query instead of walsender.

The attached patch applies on top of logical replication patch-set v14.
I will add this as separate entry to the CF.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy.patch text/x-diff 141.5 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2016-12-19 17:25:50
Message-ID: 961aec4f-6433-d9b3-7db4-64ade6e2c818@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/19/16 4:30 AM, Petr Jelinek wrote:
> So existing table data can be copied once subscription is created, but
> also new tables can be added and their data copied. This is where the
> REFRESH PUBLICATION comes into play. Adding table to publication does
> not make it automatically replicated by the subscription as the
> subscription does not have tracking info for that table. So to add new
> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
> subscriber otherwise the data won't be replicated.

Couldn't the subscriber automatically add tracking info when apply
stream data arrives for a relation it has not seen before?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Craig Ringer <craig(dot)ringer(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject: Re: Logical replication existing data copy
Date: 2016-12-20 00:46:24
Message-ID: CAMsr+YEUryigUb4ZaFv-WtKBo1Vhumff3Qm-PBtWz3UY7FHEMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20 Dec. 2016 1:27 am, "Peter Eisentraut" <
peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:

On 12/19/16 4:30 AM, Petr Jelinek wrote:
> So existing table data can be copied once subscription is created, but
> also new tables can be added and their data copied. This is where the
> REFRESH PUBLICATION comes into play. Adding table to publication does
> not make it automatically replicated by the subscription as the
> subscription does not have tracking info for that table. So to add new
> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
> subscriber otherwise the data won't be replicated.

Couldn't the subscriber automatically add tracking info when apply
stream data arrives for a relation it has not seen before?

If no table has been created by the user and we start trying to apply a
data stream apply will break.

Since manual action is needed to create the destination I don't see a
problem with requiring manual enabling too, personally.

Let the fully transparent way wait until we can do DDL replication in v11+


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2016-12-20 01:08:09
Message-ID: aaf025c4-8450-6bc3-f124-68156c7bc2ec@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19/12/16 18:25, Peter Eisentraut wrote:
> On 12/19/16 4:30 AM, Petr Jelinek wrote:
>> So existing table data can be copied once subscription is created, but
>> also new tables can be added and their data copied. This is where the
>> REFRESH PUBLICATION comes into play. Adding table to publication does
>> not make it automatically replicated by the subscription as the
>> subscription does not have tracking info for that table. So to add new
>> table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
>> subscriber otherwise the data won't be replicated.
>
> Couldn't the subscriber automatically add tracking info when apply
> stream data arrives for a relation it has not seen before?
>

Sure, but it has many caveats:
- what if the table does not exist
- what it if exists and already has data
- what if the table is rarely written to

We can't control any of that until we have DDL replication/automatic
structure dumping. Once we have those, we can add options to control
default behavior per subscriber, but with current feature set, anything
that does not require user action will behave non-deterministically
which is usually confusing.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-01-11 16:10:05
Message-ID: 88c743e3-2c16-1eaf-bb35-ef60afb16f52@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/19/16 4:30 AM, Petr Jelinek wrote:
> This patch implements data synchronization for the logical replication.
> It works both for initial setup of subscription as well as for tables
> added later to replication when the subscription is already active.

First detailed read-through. General structure makes sense.

Comments on some details:

No catalogs.sgml documentation for pg_subscription_rel. When that is
added, I would emphasize that entries are only added when relations
are first encountered, not immediately when a table is added to a
publication. So it is unlike pg_publication_rel in that respect.

Rename max_subscription_sync_workers ->
max_sync_workers_per_subscription (similar to
max_parallel_workers_per_gather and others)

About the changes to COPY: It took me a while to get clarity on the
direction of things. Is the read_cb reading the table, or the data?
You are copying data produced by a function into a table, so
produce_cb or data_source_cb could be clearer.

SetSubscriptionRelState(): This is doing an "upsert", so I don't think
a RowExclusiveLock is enough, or it needs to be able to retry on
concurrent changes. I suppose there shouldn't be more than one
concurrent change per sub/rel pair, but that would need to be
explained there.

SetSubscriptionRelState(): The memset(values, 0, sizeof(values)) is
kind of in an odd place. Possibly not actually needed.

GetSubscriptionRelState(): Prefer error messages that identify the
objects by name. (Also subid should be %u.)

GetSubscriptionRelationsFilter(): The state and filter arguments are
kind of weird. And there are only two callers, so all this complexity
is not even used. Perhaps, if state == SUBREL_STATE_UNKNOWN, then
return everything, else return exactly the state specified. The case
of everything-but-the-state-specified does not appear to be needed.

GetSubscriptionRelationsFilter(): RowExclusiveLock is probably too
much

This patch adds the fourth definition of oid_cmp() (also known as
oidComparator()) and the 12th definition of atooid(). Let's collect
those in a central place. I'm sending a separate patch for that. (No
need to change here until that is resolved, of course.)

AlterSubscription_refresh(): Put the if (wrconn == NULL) case first
and error out, and then put the rest of the code into the main
function block, saving one level of indentation.

AlterSubscription_refresh(): remote_table_oids isn't really the
remote OIDs of the tables but the local OIDs of the remote tables.
Consider clearer variable naming in that function.

interesting_relation(): very generic name, maybe
should_apply_changes_for_rel(). Also the comment mentions a "parallel
worker" -- maybe better a "separate worker process running in
parallel", since a parallel worker is something different.

LogicalRepApplyLoop() changed to non-static, but not used anywhere
else.

process_syncing_tables_*(): Function names and associated comments are
not very clear (process what?, handle what?).

In process_syncing_tables_apply(), it says that
logicalrep_worker_count() counts the apply worker as well, but what
happens if the apply worker has temporarily disappeared? Since
logicalrep_worker_count() is only used in this one place, maybe have
it count just the sync workers and rename it slightly.

Changed some LOG messages to DEBUG, not clear what the purpose there is.

check_max_subscription_sync_workers(): Same problem as discussed
previously, checking a GUC setting against another GUC setting like
this doesn't work. Just skip it and check at run time.

wait_for_sync_status_change(): Comment is wrong/misleading: It doesn't
wait until it matches any specific state, it just waits for any state
change.

LogicalRepSyncTableStart(): The code that assembles the slot name
needs to be aware of NAMEDATALEN. (Maybe at least throw in a static
assert that NAMEDATALEN>=64.)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-01-18 13:46:56
Message-ID: 5a90e2ff-6375-2e30-d0af-f6d1f672429a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/01/17 17:10, Peter Eisentraut wrote:
> On 12/19/16 4:30 AM, Petr Jelinek wrote:
>> This patch implements data synchronization for the logical replication.
>> It works both for initial setup of subscription as well as for tables
>> added later to replication when the subscription is already active.
>
> First detailed read-through. General structure makes sense.

Thanks!

>
> Comments on some details:
>
> No catalogs.sgml documentation for pg_subscription_rel. When that is
> added, I would emphasize that entries are only added when relations
> are first encountered, not immediately when a table is added to a
> publication. So it is unlike pg_publication_rel in that respect.
>

It's not even first encountered, but I did explain.

> Rename max_subscription_sync_workers ->
> max_sync_workers_per_subscription (similar to
> max_parallel_workers_per_gather and others)
>

Makes sense.

> About the changes to COPY: It took me a while to get clarity on the
> direction of things. Is the read_cb reading the table, or the data?
> You are copying data produced by a function into a table, so
> produce_cb or data_source_cb could be clearer.
>

The data_source_cb sounds good to me.

> SetSubscriptionRelState(): This is doing an "upsert", so I don't think
> a RowExclusiveLock is enough, or it needs to be able to retry on
> concurrent changes. I suppose there shouldn't be more than one
> concurrent change per sub/rel pair, but that would need to be
> explained there.
>

Well technically there can be some concurrency via the ALTER
SUBSCRIPTION ... REFRESH so I increased lock level (and same for Remove).

> SetSubscriptionRelState(): The memset(values, 0, sizeof(values)) is
> kind of in an odd place. Possibly not actually needed.
>

It might not be needed but we traditionally do it anyway. I moved it a bit.

> GetSubscriptionRelState(): Prefer error messages that identify the
> objects by name. (Also subid should be %u.)
>

Well this is cache lookup failure though, who's to know that the objects
actually exist in that case.

> GetSubscriptionRelationsFilter(): The state and filter arguments are
> kind of weird. And there are only two callers, so all this complexity
> is not even used. Perhaps, if state == SUBREL_STATE_UNKNOWN, then
> return everything, else return exactly the state specified. The case
> of everything-but-the-state-specified does not appear to be needed.
>

I see this was bit confusing so I split the function into two. Actually
the 2 usecases used are everything and everything except SUBREL_STATE_READY.

> GetSubscriptionRelationsFilter(): RowExclusiveLock is probably too
> much
>

Yes, same with GetSubscriptionRelState().

> AlterSubscription_refresh(): remote_table_oids isn't really the
> remote OIDs of the tables but the local OIDs of the remote tables.
> Consider clearer variable naming in that function.
>

Done.

> interesting_relation(): very generic name, maybe
> should_apply_changes_for_rel(). Also the comment mentions a "parallel
> worker" -- maybe better a "separate worker process running in
> parallel", since a parallel worker is something different.
>

Done.

>
> process_syncing_tables_*(): Function names and associated comments are
> not very clear (process what?, handle what?).
>

Ok added some more explanation, hopefully it's better now.

> In process_syncing_tables_apply(), it says that
> logicalrep_worker_count() counts the apply worker as well, but what
> happens if the apply worker has temporarily disappeared? Since

Then the function is never going to be called for that subscription as
it's only called from the apply.

> logicalrep_worker_count() is only used in this one place, maybe have
> it count just the sync workers and rename it slightly.
>

Makes sense anyway though.

> Changed some LOG messages to DEBUG, not clear what the purpose there is.

In fact I changed some DEBUG messages to LOG in the main patch set, git
rebase just does not handle that very well. Fixed.

> check_max_subscription_sync_workers(): Same problem as discussed
> previously, checking a GUC setting against another GUC setting like
> this doesn't work. Just skip it and check at run time.
>

Yeah, we always check at run time.

> wait_for_sync_status_change(): Comment is wrong/misleading: It doesn't
> wait until it matches any specific state, it just waits for any state
> change.
>

Fixed.

> LogicalRepSyncTableStart(): The code that assembles the slot name
> needs to be aware of NAMEDATALEN. (Maybe at least throw in a static
> assert that NAMEDATALEN>=64.)
>

I switched to snprintf seems cleaner (also removes the need to know
about proper memory context of a private variable from
LogicalRepSyncTableStart()).

I also added option called SKIP CONNECT to CREATE SUBSCRIPTION (yes that
WIP name, I welcome suggestions). That skips the initial connection
attempt which is now needed always since we need to copy the table list.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v2.patch text/plain 147.0 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-01-18 16:35:30
Message-ID: 38b9a72089536068c1c645c859b9a750@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-01-18 14:46, Petr Jelinek wrote:

> 0001-Logical-replication-support-for-initial-data-copy-v2.patch

Applies and builds fine on top of the previous 5 patches.

Two problems:

1. alter_subscription.sgml has an unpaired <command>-tag, which breaks
the doc-build:
This is the offending patch-line:
+ <command>CREATE SUBSCRIPTION</command> with <command>COPY
DATA<command>

2. Running the below (a version of the earlier pgbench_derail.sh) I have
found that
create subscription sub1 .. with (disabled); and then alter
subscription sub1 enable;
cannot be run immediately, consecutively. The error is avoided when the
two
commands are separated (for instance, below in separate psql- calls).

I don't understand why this is but it is reliably so.

The error(s):
2017-01-18 17:26:56.126 CET 24410 LOG: starting logical replication
worker for subscription "sub1"
2017-01-18 17:26:56.132 CET 26291 LOG: logical replication apply for
subscription sub1 started
2017-01-18 17:26:56.139 CET 26291 LOG: starting logical replication
worker for subscription "sub1"
2017-01-18 17:26:56.145 CET 26295 LOG: logical replication sync for
subscription sub1, table pgbench_accounts started
2017-01-18 17:26:56.534 CET 26295 ERROR: duplicate key value violates
unique constraint "pgbench_accounts_pkey"
2017-01-18 17:26:56.534 CET 26295 DETAIL: Key (aid)=(1) already exists.
2017-01-18 17:26:56.534 CET 26295 CONTEXT: COPY pgbench_accounts, line
1
2017-01-18 17:26:56.536 CET 21006 LOG: worker process: logical
replication worker 41015 sync 40991 (PID 26295) exited with exit code 1
2017-01-18 17:26:56.536 CET 26291 LOG: starting logical replication
worker for subscription "sub1"
2017-01-18 17:26:56.542 CET 26297 LOG: logical replication sync for
subscription sub1, table pgbench_branches started
2017-01-18 17:26:57.015 CET 26297 ERROR: duplicate key value violates
unique constraint "pgbench_branches_pkey"
2017-01-18 17:26:57.015 CET 26297 DETAIL: Key (bid)=(1) already exists.
2017-01-18 17:26:57.015 CET 26297 CONTEXT: COPY pgbench_branches, line
1
2017-01-18 17:26:57.017 CET 21006 LOG: worker process: logical
replication worker 41015 sync 40994 (PID 26297) exited with exit code 1
2017-01-18 17:26:57.017 CET 26291 LOG: starting logical replication
worker for subscription "sub1"
2017-01-18 17:26:57.023 CET 26299 LOG: logical replication sync for
subscription sub1, table pgbench_history started
2017-01-18 17:26:57.487 CET 26299 LOG: logical replication
synchronization worker finished processing
2017-01-18 17:26:57.488 CET 26291 LOG: starting logical replication
worker for subscription "sub1"
2017-01-18 17:26:57.491 CET 26301 LOG: logical replication sync for
subscription sub1, table pgbench_tellers started
2017-01-18 17:26:57.948 CET 26301 ERROR: duplicate key value violates
unique constraint "pgbench_tellers_pkey"
2017-01-18 17:26:57.948 CET 26301 DETAIL: Key (tid)=(1) already exists.
2017-01-18 17:26:57.948 CET 26301 CONTEXT: COPY pgbench_tellers, line 1
etc, etc.

#!/bin/sh

# assumes both instances are running, initially without publication or
subscription

unset PGSERVICEFILE PGSERVICE PGPORT PGDATA PGHOST
env | grep PG

PGDATABASE=testdb

# clear logs
echo >
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication
echo >
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2

port1=6972
port2=6973

function cb()
{
# display the 4 pgbench tables' accumulated content as md5s
# a,b,t,h stand for: pgbench_accounts, -branches, -tellers, -history
for port in $port1 $port2
do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql
-qtAXp$port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"|psql
-qtAXp$port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers order by tid"|psql
-qtAXp$port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history order by hid"|psql
-qtAXp$port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp
$port)
cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp
$port)
cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp
$port)
cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp
$port)
printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a $cnt_b $cnt_t
$cnt_h
echo -n " $md5_a $md5_b $md5_t $md5_h"
if [[ $port -eq $port1 ]]; then echo " master"
elif [[ $port -eq $port2 ]]; then echo " replica"
else echo " ERROR"
fi
done
}

echo "
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -X -p $port1 \
&& echo "
drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -X -p $port2 \
&& pgbench -p $port1 -qis 1 \
&& echo "
alter table pgbench_history add column hid serial primary key;
-- alter table pgbench_history replica identity full;
" | psql -1p $port1 \
&& pg_dump -F c -p $port1 \
-t pgbench_accounts \
-t pgbench_branches \
-t pgbench_tellers \
-t pgbench_history \
| pg_restore -p $port2 -d testdb

echo "$(cb)"

sleep 2

echo "$(cb)"

echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX

# this demostrates the bug:
echo "create subscription sub1 connection 'port=${port1}' publication
pub1 with (disabled);
alter subscription sub1 enable; " | psql -p $port2 -aqtAX
# like this (create and alter together in a single psql-call) fails;
# with the commands separated (to send them separately to psql suffices)
there is no problem.

#------------------------------------

#exit
#echo "$(cb)"
#echo "-- pgbench -c 1 -T 10 -P 5 -n (short run, first)"
# pgbench -c 1 -T 10 -P 5 -n


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-01-18 19:45:54
Message-ID: 4323fc2b-9410-cf12-b648-2b8d55603e9e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18/01/17 17:35, Erik Rijkers wrote:
> On 2017-01-18 14:46, Petr Jelinek wrote:
>
>> 0001-Logical-replication-support-for-initial-data-copy-v2.patch
>
> Applies and builds fine on top of the previous 5 patches.
>
> Two problems:
>
> 1. alter_subscription.sgml has an unpaired <command>-tag, which breaks
> the doc-build:
> This is the offending patch-line:
> + <command>CREATE SUBSCRIPTION</command> with <command>COPY
> DATA<command>
>

Hmm, I wonder how did that compile on my machine as it's indeed syntax
error.

>
> 2. Running the below (a version of the earlier pgbench_derail.sh) I have
> found that
> create subscription sub1 .. with (disabled); and then alter
> subscription sub1 enable;
> cannot be run immediately, consecutively. The error is avoided when the
> two
> commands are separated (for instance, below in separate psql- calls).
>
> I don't understand why this is but it is reliably so.
>

AFAICS you should always get error from that test after you enable the
subscription, no matter if you enable immediately or later. The default
behavior is to copy the data and your test already copies them via
pg_dump/pg_restore.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-01-19 08:39:16
Message-ID: e2eb061aa540dde92fefc643b9303599@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-01-18 20:45, Petr Jelinek wrote:
>
> AFAICS you should always get error from that test after you enable the

Ah tes, you were right, of course; I had assumed the earlier mentioned
CREATE SUBSCRIPTION ... [ WITH (COPY DATA | NOCOPY DATA) ]
but that syntax wasn't implemented, I now understand.

Taking that into account, my older tests work OK again (using the
7-patches below).

Other small issue: using this patch-set:

> 0001-Add-PUBLICATION-catalogs-and-DDL-v18.patch
> 0002-Add-SUBSCRIPTION-catalog-and-DDL-v18.patch
> 0003-Define-logical-replication-protocol-and-output-plugi-v18.patch
> 0004-Add-logical-replication-workers-v18fixed.patch
> 0005-Add-separate-synchronous-commit-control-for-logical--v18.patch
> 0001-Logical-replication-support-for-initial-data-copy-v2.patch
> 0006-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch

( This is now the patch-set to test, is that correct? )

make check complains:

***
/home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/expected/subscription.out 2017-01-19
09:26:41.354703032 +0100
---
/home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/results/subscription.out 2017-01-19
09:29:56.104685043 +0100
***************
*** 53,62 ****
COMMIT;
ALTER SUBSCRIPTION testsub RENAME TO testsub_foo;
\dRs
! List of subscriptions
! Name | Owner | Enabled | Publication
!
-------------+---------------------------+---------+--------------------
! testsub_foo | regress_subscription_user | f |
{testpub,testpub1}
(1 row)

DROP SUBSCRIPTION testsub_foo NODROP SLOT;
--- 53,62 ----
COMMIT;
ALTER SUBSCRIPTION testsub RENAME TO testsub_foo;
\dRs
! List of subscriptions
! Name | Owner | Enabled | Publication
! -------------+---------------------------+---------+-------------
! testsub_foo | regress_subscription_user | f | {testpub}
(1 row)

DROP SUBSCRIPTION testsub_foo NODROP SLOT;

======================================================================

Thanks,

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-01-19 18:03:29
Message-ID: 0f48754e-6288-45ea-dfdb-05ed8eadc2da@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 19/01/17 09:39, Erik Rijkers wrote:
> On 2017-01-18 20:45, Petr Jelinek wrote:
>>
>> AFAICS you should always get error from that test after you enable the
>
> Ah tes, you were right, of course; I had assumed the earlier mentioned
> CREATE SUBSCRIPTION ... [ WITH (COPY DATA | NOCOPY DATA) ]
> but that syntax wasn't implemented, I now understand.
>
> Taking that into account, my older tests work OK again (using the
> 7-patches below).
>

Good to hear.

>
> Other small issue: using this patch-set:
>
>> 0001-Add-PUBLICATION-catalogs-and-DDL-v18.patch
>> 0002-Add-SUBSCRIPTION-catalog-and-DDL-v18.patch
>> 0003-Define-logical-replication-protocol-and-output-plugi-v18.patch
>> 0004-Add-logical-replication-workers-v18fixed.patch
>> 0005-Add-separate-synchronous-commit-control-for-logical--v18.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v2.patch
>> 0006-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>
> ( This is now the patch-set to test, is that correct? )
>

The rename should be before this one, but I guess with v2 that didn't
work well yet as it didn't know about rename.

> make check complains:
>
> ***
> /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/expected/subscription.out
> 2017-01-19 09:26:41.354703032 +0100
> ---
> /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/results/subscription.out
> 2017-01-19 09:29:56.104685043 +0100
> ***************
> *** 53,62 ****
> COMMIT;
> ALTER SUBSCRIPTION testsub RENAME TO testsub_foo;
> \dRs
> ! List of subscriptions
> ! Name | Owner | Enabled | Publication
> ! -------------+---------------------------+---------+--------------------
> ! testsub_foo | regress_subscription_user | f | {testpub,testpub1}
> (1 row)
>
> DROP SUBSCRIPTION testsub_foo NODROP SLOT;
> --- 53,62 ----
> COMMIT;
> ALTER SUBSCRIPTION testsub RENAME TO testsub_foo;
> \dRs
> ! List of subscriptions
> ! Name | Owner | Enabled | Publication
> ! -------------+---------------------------+---------+-------------
> ! testsub_foo | regress_subscription_user | f | {testpub}
> (1 row)
>
> DROP SUBSCRIPTION testsub_foo NODROP SLOT;
>
> ======================================================================
>

Meh, I really messed up the rebase this time.

Okay, here is v3 with some small fixes and rebased on top of rename.
Also it's rebased without the
0005-Add-separate-synchronous-commit-control-for-logical--v18.patch as I
don't expect that one to go further for now.

Thanks for testing!

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v3.patch text/plain 147.6 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-01 06:18:08
Message-ID: CAB7nPqQMd2Aay-fLjLrGqZ5asoC52B1r0bt1L-6KT=V7JfQk5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 20, 2017 at 3:03 AM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> Okay, here is v3 with some small fixes and rebased on top of rename.
> Also it's rebased without the
> 0005-Add-separate-synchronous-commit-control-for-logical--v18.patch as I
> don't expect that one to go further for now.
>
> Thanks for testing!

This patch needs a rebase, moved to next CF with "waiting on author".
--
Michael


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-08 22:24:12
Message-ID: 769517d9-4186-b66e-b3f8-436ca400b1ab@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

here is updated patch.

Note that it's rebased on top of logical replication improvements
patches [1] (which still apply fine to my surprise).

It will probably need another rebase once patches from Masahiko Sawada
and Fujii Masao get in.

[1]
https://www.postgresql.org/message-id/42655eb4-6b2e-b35b-c184-509efd6eba10%402ndquadrant.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-08 22:25:40
Message-ID: 9fb3f37e-0737-e9b1-a104-bb7b58a776c3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/02/17 23:24, Petr Jelinek wrote:
> Hi,
>
> here is updated patch.
>
> Note that it's rebased on top of logical replication improvements
> patches [1] (which still apply fine to my surprise).
>
> It will probably need another rebase once patches from Masahiko Sawada
> and Fujii Masao get in.
>
> [1]
> https://www.postgresql.org/message-id/42655eb4-6b2e-b35b-c184-509efd6eba10%402ndquadrant.com
>

And now with the actual patch attached :)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v4.patch text/plain 145.9 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-09 01:25:01
Message-ID: ab5a69bd5945a45747493bc76c17a87c@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-08 23:25, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
> 0001-Logical-replication-support-for-initial-data-copy-v4.patch

test 'object_address' fails, see atachment.

That's all I found in a quick first trial.

thanks,

Erik Rijkers

Attachment Content-Type Size
regression.diffs text/x-diff 8.0 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-11 10:16:34
Message-ID: 479dceb8c0acc8026ced81caf9cce750@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-08 23:25, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
> 0001-Logical-replication-support-for-initial-data-copy-v4.patch

Apart from the failing one make check test (test 'object_address') which
I reported earlier, I find it is easy to 'confuse' the replication.

I attach a script that intends to test the default COPY DATA. There
are two instances, initially without any replication. The script inits
pgbench on the master, adds a serial column to pgbench_history, and
dump-restores the 4 pgbench-tables to the future replica. It then
empties the 4 pgbench-tables on the 'replica'. The idea is that when
logrep is initiated, data will be replicated from master, with the end
result being that there are 4 identical tables on master and replica.

This often works but it also fails far too often (in my hands). I test
whether the tables are identical by comparing an md5 from an ordered
resultset, from both replica and master. I estimate that 1 in 5 tries
fail; 'fail' being a somewhat different table on replica (compared to
mater), most often pgbench_accounts (typically there are 10-30 differing
rows). No errors or warnings in either logfile. I'm not sure but I
think testing on faster machines seem to be doing somewhat better
('better' being less replication error).

Another, probably unrelated, problem occurs (but much more rarely) when
executing 'DROP SUBSCRIPTION sub1' on the replica (see the beginning of
the script). Sometimes that command hangs, and refuses to accept
shutdown of the server. I don't know how to recover from this -- I just
have to kill the replica server (master server still obeys normal
shutdown) and restart the instances.

The script accepts 2 parameters, scale and clients (used in pgbench -s
resp. -c)

I don't think I've managed to successfully run the script with more than
1 client yet.

Can you have a look whether this is reproducible elsewhere?

thanks,

Erik Rijkers

Attachment Content-Type Size
pgbench_derail2.sh text/x-shellscript 6.0 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy - sgml fixes
Date: 2017-02-13 10:09:12
Message-ID: 15ad9d609043b773a111b97cf9c784a3@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-09 02:25, Erik Rijkers wrote:
> On 2017-02-08 23:25, Petr Jelinek wrote:
>
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>

fixes in create_subscription.sgml

Attachment Content-Type Size
create_subscription.sgml.diff text/x-diff 2.4 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-13 13:51:46
Message-ID: b03bc0134161272b6bc4577b96c59de2@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-11 11:16, Erik Rijkers wrote:
> On 2017-02-08 23:25, Petr Jelinek wrote:
>
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>
> This often works but it also fails far too often (in my hands). I
> test whether the tables are identical by comparing an md5 from an
> ordered resultset, from both replica and master. I estimate that 1 in
> 5 tries fail; 'fail' being a somewhat different table on replica
> (compared to mater), most often pgbench_accounts (typically there are
> 10-30 differing rows). No errors or warnings in either logfile. I'm
> not sure but I think testing on faster machines seem to be doing
> somewhat better ('better' being less replication error).
>

I have noticed that when I insert a few seconds wait-state after the
create subscription (or actually: the 'enable'ing of the subscription)
the problem does not occur. Apparently, (I assume) the initial snapshot
occurs somewhere when the subsequent pgbench-run has already started, so
that the logical replication also starts somewhere 'into' that
pgbench-run. Does that make sense?

I don't know what to make of it. Now that I think that I understand
what happens I hesitate to call it a bug. But I'd say it's still a
useability problem that the subscription is only 'valid' after some
time, even if it's only a few seconds.

(the other problem I mentioned (drop subscription hangs) still happens
every now and then)


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-15 23:43:58
Message-ID: 16135dcb-0b52-2347-2173-9fb2cfeef7ad@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13/02/17 14:51, Erik Rijkers wrote:
> On 2017-02-11 11:16, Erik Rijkers wrote:
>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>
>> This often works but it also fails far too often (in my hands). I
>> test whether the tables are identical by comparing an md5 from an
>> ordered resultset, from both replica and master. I estimate that 1 in
>> 5 tries fail; 'fail' being a somewhat different table on replica
>> (compared to mater), most often pgbench_accounts (typically there are
>> 10-30 differing rows). No errors or warnings in either logfile. I'm
>> not sure but I think testing on faster machines seem to be doing
>> somewhat better ('better' being less replication error).
>>
>
> I have noticed that when I insert a few seconds wait-state after the
> create subscription (or actually: the 'enable'ing of the subscription)
> the problem does not occur. Apparently, (I assume) the initial snapshot
> occurs somewhere when the subsequent pgbench-run has already started, so
> that the logical replication also starts somewhere 'into' that
> pgbench-run. Does that make sense?
>
> I don't know what to make of it. Now that I think that I understand
> what happens I hesitate to call it a bug. But I'd say it's still a
> useability problem that the subscription is only 'valid' after some
> time, even if it's only a few seconds.
>

It is a bug, we are going to great lengths to create data snapshot that
corresponds to specific LSN so that we are able to decode exactly the
changes that happened since the data snapshot was taken. And the
tablecopy.c does quite a lot to synchronize table handover to main apply
process so that there is correct continuation of data stream as well. So
the end result is that concurrent changes are supposed to be okay and
eventually replication should catch up and the contents should be the same.

That being said, I am so far having problems reproducing this on my test
machine(s) so no idea what causes it yet.

Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that as
well?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-16 09:36:58
Message-ID: 56f3ec6f1989c738a0fa865b13d25761@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-16 00:43, Petr Jelinek wrote:
> On 13/02/17 14:51, Erik Rijkers wrote:
>> On 2017-02-11 11:16, Erik Rijkers wrote:
>>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>>
>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>>
>>> This often works but it also fails far too often (in my hands). I
>
> Could you periodically dump contents of the pg_subscription_rel on
> subscriber (ideally when dumping the md5 of the data) and attach that
> as
> well?

I attach a bash script (and its output) that polls the 4 pgbench table's
md5s and the pg_subscription_rel table, each second, while I run the
pgbench_derail2.sh (for that see my earlier mail).

pgbench_derail2.sh writes a 'header' into the same output stream (search
for '^===' ).

The .out file reflects a session where I started pgbench_derail2.sh
twice (it removes the publication and subscription at startup). So
there are 2 headers in the attached cb_20170216_10_04_47.out. The first
run ended in a succesful replication (=all 4 pgbench tables
md5-identical). The second run does not end correctly: it has (one of)
the typical faulty end-states: pgbench_accounts, the copy, has a few
less rows than the master table.

Other typical endstates are:
same number of rows but content not identical (for some, typically < 20
rows).
mostly pgbench_accounts and pgbench_history are affected.

(I see now that I made some mistakes in generating the timestamps in the
.out file but I suppose it doesn't matter too much)

I hope it helps; let me know if I can do any other test(s).

Attachment Content-Type Size
cb_20170216_10_04_47.out text/plain 46.1 KB
cb.sh text/x-shellscript 2.5 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-18 13:17:26
Message-ID: 7e143a286c673386db18848ac29fe6c8@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-16 00:43, Petr Jelinek wrote:
> On 13/02/17 14:51, Erik Rijkers wrote:
>> On 2017-02-11 11:16, Erik Rijkers wrote:
>>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>>
>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>>
>>> This often works but it also fails far too often (in my hands). I
>
> That being said, I am so far having problems reproducing this on my
> test
> machine(s) so no idea what causes it yet.
>

A few extra bits:

- I have repeated this now on three different machines (debian 7, 8,
centos6; one a pretty big server); there is always failure within a few
tries of that test program (i.e. pgbench_derail2.sh, with the above 5
patches).

- I have also tried to go back to an older version of logrep: running
with 2 instances with only the first four patches (i.e., leaving out the
support-for-existing-data patch). With only those 4, the logical
replication is solid. (a quick 25x repetition of a (very similar) test
program is 100% successful). So the problem is likely somehow in that
last 5th patch.

- A 25x repetition of a test on a master + replica 5-patch server yields
13 ok, 12 NOK.

- Is the 'make check' FAILED test 'object_addess' unrelated? (Can you
at least reproduce that failed test?)

Maybe add this to the 10 Open Items list?
https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items

It may garner a bit more attention.

thanks,

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-18 13:24:18
Message-ID: 95fb5ed539594c9244c67bca255985c5@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> Maybe add this to the 10 Open Items list?
> https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items
>
> It may garner a bit more attention.
>

Ah sorry, it's there already.


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-18 13:48:11
Message-ID: d4a5b160d4484adda4f4c755979b4973@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-11 11:16, Erik Rijkers wrote:
> On 2017-02-08 23:25, Petr Jelinek wrote:
>
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>

Let me add the script ('instances.sh') that I use to startup the two
logical replication instances for testing.

Together with the earlier posted 'pgbench_derail2.sh' it makes out the
fails test.

pg_config of the master is:

$ pg_config
BINDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin
DOCDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc
HTMLDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc
INCLUDEDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include
PKGINCLUDEDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include
INCLUDEDIR-SERVER =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include/server
LIBDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib
PKGLIBDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib
LOCALEDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/locale
MANDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/man
SHAREDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share
SYSCONFDIR =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/etc
PGXS =
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE =
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication'
'--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin'
'--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib'
'--with-pgport=6972' '--enable-depend' '--enable-cassert'
'--enable-debug' '--with-openssl' '--with-perl' '--with-libxml'
'--with-libxslt' '--with-zlib' '--enable-tap-tests'
'--with-extra-version=_logical_replication_20170218_1221_e3a58c8835a2'
CC = gcc
CPPFLAGS = -DFRONTEND -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -g -O2
CFLAGS_SL = -fpic
LDFLAGS = -L../../src/common -Wl,--as-needed
-Wl,-rpath,'/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline
-lrt -lcrypt -ldl -lm
VERSION = PostgreSQL
10devel_logical_replication_20170218_1221_e3a58c8835a2

I hope it helps someone to reproduce the errors I get. (If you don't,
I'd like to hear that too)

thanks,

Erik Rijkers

Attachment Content-Type Size
instances.sh text/x-shellscript 1.0 KB

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-18 14:01:11
Message-ID: 79d40e7c-7977-b503-1bad-bba18b5b4711@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18/02/17 14:24, Erik Rijkers wrote:
>>
>> Maybe add this to the 10 Open Items list?
>> https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items
>>
>> It may garner a bit more attention.
>>
>
> Ah sorry, it's there already.

Hmm that's interesting given that it's not even committed yet :)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-18 14:05:12
Message-ID: 5e9bce06-1bb7-4618-2d0b-c427b55624b6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18/02/17 14:17, Erik Rijkers wrote:
> On 2017-02-16 00:43, Petr Jelinek wrote:
>> On 13/02/17 14:51, Erik Rijkers wrote:
>>> On 2017-02-11 11:16, Erik Rijkers wrote:
>>>> On 2017-02-08 23:25, Petr Jelinek wrote:
>>>>
>>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>>>>
>>>> This often works but it also fails far too often (in my hands). I
>>
>> That being said, I am so far having problems reproducing this on my test
>> machine(s) so no idea what causes it yet.
>>
>
> A few extra bits:
>
> - I have repeated this now on three different machines (debian 7, 8,
> centos6; one a pretty big server); there is always failure within a few
> tries of that test program (i.e. pgbench_derail2.sh, with the above 5
> patches).
>
> - I have also tried to go back to an older version of logrep: running
> with 2 instances with only the first four patches (i.e., leaving out the
> support-for-existing-data patch). With only those 4, the logical
> replication is solid. (a quick 25x repetition of a (very similar) test
> program is 100% successful). So the problem is likely somehow in that
> last 5th patch.
>
> - A 25x repetition of a test on a master + replica 5-patch server yields
> 13 ok, 12 NOK.
>
> - Is the 'make check' FAILED test 'object_addess' unrelated? (Can you
> at least reproduce that failed test?)
>

Yes, it has nothing to do with that, that just needs to be updated to
use SKIP CONNECT instead of NOCREATE SLOT in this patch since NOCREATE
SLOT is no longer enough to skip the connection attempt. And I have that
fixed locally, but that does not deserve new patch version given the
main issue you reported.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy - comments origin.c
Date: 2017-02-19 16:21:37
Message-ID: 0a00a988fb5e73429f0e3e42e972f3ea@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
> 0001-Logical-replication-support-for-initial-data-copy-v4.patch

Improve readability of comment blocks
in src/backend/replication/logical/origin.c

thanks,

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy - comments origin.c
Date: 2017-02-19 16:23:29
Message-ID: caaa025e96b310d10a09a3cf37fd8779@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-19 17:21, Erik Rijkers wrote:
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>
> Improve readability of comment blocks
> in src/backend/replication/logical/origin.c
>

now attached

>
> thanks,
>
> Erik Rijkers

Attachment Content-Type Size
origin.c.diff text/x-diff 6.7 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy - comments snapbuild.c
Date: 2017-02-19 22:24:16
Message-ID: d848e7828c53bdbe5a8cd25983fe16c7@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch

Improve comment blocks in
src/backend/replication/logical/snapbuild.c


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy - comments snapbuild.c
Date: 2017-02-19 22:27:00
Message-ID: daf2b5fb5adfa5a566b3fb7db16582ee@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-19 23:24, Erik Rijkers wrote:
>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
>>>> 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
>>>> 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
>>>> 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
>>>> 0001-Logical-replication-support-for-initial-data-copy-v4.patch
>
> Improve comment blocks in
> src/backend/replication/logical/snapbuild.c
>

[deep sigh...] attached...

Attachment Content-Type Size
snapbuild.c.diff text/x-diff 2.5 KB

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-22 12:03:53
Message-ID: 840bbff0-b062-c66d-be18-8153ddf8dc5e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

so I managed to reproduce this. The main issue is the bug in snapbuild.c
as mentioned in nearby thread. But there are couple of fixes in this
patch to make it work correctly (and also to fix the regression test
failure). So here is updated version of the patch that should work
correctly (in combination with other fixes).

I also decided to change SKIP CONNECT to NOCONNECT for consistency with
other options/commands.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v5.patch text/x-patch 147.6 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-22 13:48:33
Message-ID: 6df65ee030416533cef97d3e12b1c425@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-22 13:03, Petr Jelinek wrote:

> 0001-Skip-unnecessary-snapshot-builds.patch
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> 0001-Logical-replication-support-for-initial-data-copy-v5.patch

It works well now, or at least my particular test case seems now solved.

thanks,

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-22 17:13:11
Message-ID: b0dbcb2a1066d6728cbf62e391e7edf4@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-22 14:48, Erik Rijkers wrote:
> On 2017-02-22 13:03, Petr Jelinek wrote:
>
>> 0001-Skip-unnecessary-snapshot-builds.patch
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>
> It works well now, or at least my particular test case seems now
> solved.

Cried victory too early, I'm afraid.

The logical replication is now certainly much more stable but there are
still errors, just less often.

The rare 'hang'-error that I mentioned a few emails back I have not yet
encountered; I am beginning to trust that that is indeed solved.

But there is still sometimes incorrect replication. The symptoms are
the ones I mentioned earlier:
- incorrect number of rows in one of (mostly) pgbench_accounts or
pgbench_history.
the numers are always off by a very small number, say less than 20,
often even only 1 row.
- incorrect content in one of pgbench_accounts or pgbench_history
(detected via md5). Also mostly the two tables named above.

I see sometimes primary key violations on the replica. That should not
be possible if I have understood the intent of logical replication
correctly.
( ERROR: duplicate key value violates unique constraint
"pgbench_tellers_pkey" )
mostly *_tellers, also seen *_branches

Understandably, the errors become more frequent with higher client
counts: a 25x repeat with 1 client yielded only 1 failed run whereas a
25x repeat with 16 clients gave 16 failures.

I attach once more the current incarnation of my test-bash pgbench
runner, pgbench_derail2.sh.
Easiest to run it yourself, I guess.

I also attach the output (of pgbench_derail2.sh) of those two 25x
repeats:
d2_scale__1_client__1_25x.txt
d2_scale__1_client_16_25x.txt

I worry a bit about the correctness of that test program
(pgbench_derail2.sh). I especially wonder if it should look around
better at startup (e.g., at stuff left over from previous iterations).
If you see any incorrect/dumb things there, or a better way to monitor
(aka pre-flight checks), please let me know.

But the current state si certainly a big step forward -- I guess it's
just your bad luck that I had the afternoon off ;)

thanks,

Erik Rijkers

Attachment Content-Type Size
pgbench_derail2.sh text/x-shellscript 7.1 KB
d2_scale__1_client__1_25x.txt text/plain 42.3 KB
d2_scale__1_client_16_25x.txt text/plain 82.1 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-23 00:02:13
Message-ID: e312e4096bff9773e57c1c397558f8e4@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-22 18:13, Erik Rijkers wrote:
> On 2017-02-22 14:48, Erik Rijkers wrote:
>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>
>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>
>> It works well now, or at least my particular test case seems now
>> solved.
>
> Cried victory too early, I'm afraid.
>

I got into a 'hung' state while repeating pgbench_derail2.sh.

Below is some state. I notice that master pg_stat_replication.syaye is
'startup'.
Maybe I should only start the test after that state has changed. Any of
the
other possible values (catchup, streaming) wuold be OK, I would think.

$ ( dbactivity.sh ; echo "; table pg_subscription; table
pg_subscription_rel;" ) | psql -qXp 6973
now | pid | app
| state | wt_evt | wt_evt_type | query_start
| duration | query
---------------------+-------+---------------------------------------------+--------+------------------------+-------------+---------------------+----------+----------------------------------
2017-02-23 00:37:57 | 31352 | logical replication worker 47435
| active | relation | Lock |
| |
2017-02-23 00:37:57 | 397 | psql
| active | BgWorkerShutdown | IPC | 2017-02-23 00:22:14
| 00:15:42 | drop subscription if exists sub1
2017-02-23 00:37:57 | 31369 | logical replication worker 47435 sync
47423 | | LogicalSyncStateChange | IPC |
| |
2017-02-23 00:37:57 | 398 | logical replication worker 47435 sync
47418 | | transactionid | Lock |
| |
(4 rows)

subdbid | subname | subowner | subenabled | subconninfo | subslotname |
subpublications
---------+---------+----------+------------+-------------+-------------+-----------------
16384 | sub1 | 10 | t | port=6972 | sub1 |
{pub1}
(1 row)

srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+------------
47435 | 47423 | w | 2/CB078260
47435 | 47412 | r |
47435 | 47415 | r |
47435 | 47418 | c | 2/CB06E158
(4 rows)

Replica (port 6973):

[bulldog aardvark] [local]:6973 (Thu) 00:52:47 [pid:5401] [testdb] #
table pg_stat_subscription ;
subid | subname | pid | relid | received_lsn |
last_msg_send_time | last_msg_receipt_time |
latest_end_lsn | latest_end_time
-------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
47435 | sub1 | 31369 | 47423 | | 2017-02-23
00:20:45.758072+01 | 2017-02-23 00:20:45.758072+01 | |
2017-02-23 00:20:45.758072+01
47435 | sub1 | 398 | 47418 | | 2017-02-23
00:22:14.896471+01 | 2017-02-23 00:22:14.896471+01 | |
2017-02-23 00:22:14.896471+01
47435 | sub1 | 31352 | | 2/CB06E158 |
| 2017-02-23 00:20:47.034664+01 | | 2017-02-23
00:20:45.679245+01
(3 rows)

Master (port 6972):

[bulldog aardvark] [local]:6972 (Thu) 00:48:27 [pid:5307] [testdb] # \x
on \\ table pg_stat_replication ;
Expanded display is on.
-[ RECORD 1 ]----+------------------------------
pid | 399
usesysid | 10
usename | aardvark
application_name | sub1_47435_sync_47418
client_addr |
client_hostname |
client_port | -1
backend_start | 2017-02-23 00:22:14.902701+01
backend_xmin |
state | startup
sent_location |
write_location |
flush_location |
replay_location |
sync_priority | 0
sync_state | async
-[ RECORD 2 ]----+------------------------------
pid | 31371
usesysid | 10
usename | aardvark
application_name | sub1_47435_sync_47423
client_addr |
client_hostname |
client_port | -1
backend_start | 2017-02-23 00:20:45.762852+01
backend_xmin |
state | startup
sent_location |
write_location |
flush_location |
replay_location |
sync_priority | 0
sync_state | async

( above 'dbactivity.sh' is:

select
rpad(now()::text,19) as now
, pid as pid
, application_name as app
, state as state
, wait_event as wt_evt
, wait_event_type as wt_evt_type
, date_trunc('second', query_start::timestamp) as query_start
, substring((now() - query_start)::text, 1, position('.' in (now() -
query_start)::text)-1) as duration
, query
from pg_stat_activity
where query !~ 'pg_stat_activity' ;

)


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-23 00:41:06
Message-ID: 2215dc56-5bbc-a15d-8945-867952798946@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 23/02/17 01:02, Erik Rijkers wrote:
> On 2017-02-22 18:13, Erik Rijkers wrote:
>> On 2017-02-22 14:48, Erik Rijkers wrote:
>>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>>
>>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>>
>>> It works well now, or at least my particular test case seems now solved.
>>
>> Cried victory too early, I'm afraid.
>>
>
> I got into a 'hung' state while repeating pgbench_derail2.sh.
>
> Below is some state. I notice that master pg_stat_replication.syaye is
> 'startup'.
> Maybe I should only start the test after that state has changed. Any of the
> other possible values (catchup, streaming) wuold be OK, I would think.
>

I think that's known issue (see comment in tablesync.c about hanging
forever). I think I may have fixed it locally.

I will submit patch once I fixed the other snapshot issue (I managed to
reproduce it as well, although very rarely so it's rather hard to test).

Thanks for all this testing btw, I really appreciate it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-24 21:58:32
Message-ID: c0f90176-efff-0770-1e79-0249fb4b9b0c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 23/02/17 01:41, Petr Jelinek wrote:
> On 23/02/17 01:02, Erik Rijkers wrote:
>> On 2017-02-22 18:13, Erik Rijkers wrote:
>>> On 2017-02-22 14:48, Erik Rijkers wrote:
>>>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>>>
>>>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>>>
>>>> It works well now, or at least my particular test case seems now solved.
>>>
>>> Cried victory too early, I'm afraid.
>>>
>>
>> I got into a 'hung' state while repeating pgbench_derail2.sh.
>>
>> Below is some state. I notice that master pg_stat_replication.syaye is
>> 'startup'.
>> Maybe I should only start the test after that state has changed. Any of the
>> other possible values (catchup, streaming) wuold be OK, I would think.
>>
>
> I think that's known issue (see comment in tablesync.c about hanging
> forever). I think I may have fixed it locally.
>
> I will submit patch once I fixed the other snapshot issue (I managed to
> reproduce it as well, although very rarely so it's rather hard to test).
>

Hi,

Here it is. But check also the snapbuild related thread for updated
patches related to that (the issue you had with this not copying all
rows is yet another pre-existing Postgres bug).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v6.patch text/x-patch 146.9 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-24 23:05:24
Message-ID: e1fa399c1af0c9c9b57fee23a7a95bac@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-24 22:58, Petr Jelinek wrote:
> On 23/02/17 01:41, Petr Jelinek wrote:
>> On 23/02/17 01:02, Erik Rijkers wrote:
>>> On 2017-02-22 18:13, Erik Rijkers wrote:
>>>> On 2017-02-22 14:48, Erik Rijkers wrote:
>>>>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>>>>
>>>>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>>>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>>>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>>>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>>>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>>>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>>>>
>>>>> It works well now, or at least my particular test case seems now
>>>>> solved.
>>>>
>>>> Cried victory too early, I'm afraid.
>>>>
>>>
>>> I got into a 'hung' state while repeating pgbench_derail2.sh.
>>>
>>> Below is some state. I notice that master pg_stat_replication.syaye
>>> is
>>> 'startup'.
>>> Maybe I should only start the test after that state has changed. Any
>>> of the
>>> other possible values (catchup, streaming) wuold be OK, I would
>>> think.
>>>
>>
>> I think that's known issue (see comment in tablesync.c about hanging
>> forever). I think I may have fixed it locally.
>>
>> I will submit patch once I fixed the other snapshot issue (I managed
>> to
>> reproduce it as well, although very rarely so it's rather hard to
>> test).
>>
>
> Hi,
>
> Here it is. But check also the snapbuild related thread for updated
> patches related to that (the issue you had with this not copying all
> rows is yet another pre-existing Postgres bug).
>

The four earlier snapbuild patches apply cleanly, but
then I get errors while applying
0001-Logical-replication-support-for-initial-data-copy-v6.patch:

patching file src/test/regress/expected/sanity_check.out
(Stripping trailing CRs from patch.)
patching file src/test/regress/expected/subscription.out
Hunk #2 FAILED at 25.
1 out of 2 hunks FAILED -- saving rejects to file
src/test/regress/expected/subscription.out.rej
(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/object_address.sql
(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/subscription.sql
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/001_rep_changes.pl
Hunk #9 succeeded at 175 with fuzz 2.
Hunk #10 succeeded at 193 (offset -9 lines).
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/002_types.pl
(Stripping trailing CRs from patch.)
can't find file to patch at input line 4296
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/test/subscription/t/003_constraints.pl
b/src/test/subscription/t/003_constraints.pl
|index 17d4565..9543b91 100644
|--- a/src/test/subscription/t/003_constraints.pl
|+++ b/src/test/subscription/t/003_constraints.pl
--------------------------
File to patch:

Can you have a look?

thanks,

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-24 23:08:00
Message-ID: c8caf66f-0bbd-a25b-0c93-e152bbee13a2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25/02/17 00:05, Erik Rijkers wrote:
> On 2017-02-24 22:58, Petr Jelinek wrote:
>> On 23/02/17 01:41, Petr Jelinek wrote:
>>> On 23/02/17 01:02, Erik Rijkers wrote:
>>>> On 2017-02-22 18:13, Erik Rijkers wrote:
>>>>> On 2017-02-22 14:48, Erik Rijkers wrote:
>>>>>> On 2017-02-22 13:03, Petr Jelinek wrote:
>>>>>>
>>>>>>> 0001-Skip-unnecessary-snapshot-builds.patch
>>>>>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>>>>>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>>>>>>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>>>>>>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>>>>>>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>>>>>>> 0001-Logical-replication-support-for-initial-data-copy-v5.patch
>>>>>>
>>>>>> It works well now, or at least my particular test case seems now
>>>>>> solved.
>>>>>
>>>>> Cried victory too early, I'm afraid.
>>>>>
>>>>
>>>> I got into a 'hung' state while repeating pgbench_derail2.sh.
>>>>
>>>> Below is some state. I notice that master pg_stat_replication.syaye is
>>>> 'startup'.
>>>> Maybe I should only start the test after that state has changed. Any
>>>> of the
>>>> other possible values (catchup, streaming) wuold be OK, I would think.
>>>>
>>>
>>> I think that's known issue (see comment in tablesync.c about hanging
>>> forever). I think I may have fixed it locally.
>>>
>>> I will submit patch once I fixed the other snapshot issue (I managed to
>>> reproduce it as well, although very rarely so it's rather hard to test).
>>>
>>
>> Hi,
>>
>> Here it is. But check also the snapbuild related thread for updated
>> patches related to that (the issue you had with this not copying all
>> rows is yet another pre-existing Postgres bug).
>>
>
>
> The four earlier snapbuild patches apply cleanly, but
> then I get errors while applying
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch:
>
>
> patching file src/test/regress/expected/sanity_check.out
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/expected/subscription.out
> Hunk #2 FAILED at 25.
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/test/regress/expected/subscription.out.rej
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/sql/object_address.sql
> (Stripping trailing CRs from patch.)
> patching file src/test/regress/sql/subscription.sql
> (Stripping trailing CRs from patch.)
> patching file src/test/subscription/t/001_rep_changes.pl
> Hunk #9 succeeded at 175 with fuzz 2.
> Hunk #10 succeeded at 193 (offset -9 lines).
> (Stripping trailing CRs from patch.)
> patching file src/test/subscription/t/002_types.pl
> (Stripping trailing CRs from patch.)
> can't find file to patch at input line 4296
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --------------------------
> |diff --git a/src/test/subscription/t/003_constraints.pl
> b/src/test/subscription/t/003_constraints.pl
> |index 17d4565..9543b91 100644
> |--- a/src/test/subscription/t/003_constraints.pl
> |+++ b/src/test/subscription/t/003_constraints.pl
> --------------------------
> File to patch:
>
>
> Can you have a look?
>

Hi,

I think you are missing the other fixes (this specific error says that
you didn't apply
0002-Fix-after-trigger-execution-in-logical-replication.patch).

There is now a lot of fixes for existing code that this patch depends
on. Hopefully some of the fixes get committed soonish.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-24 23:39:21
Message-ID: 73c894648ec0629a8ca25d167812f05f@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-25 00:08, Petr Jelinek wrote:
>
> There is now a lot of fixes for existing code that this patch depends
> on. Hopefully some of the fixes get committed soonish.

Indeed - could you look over the below list of 8 patches; is it correct
and in the right (apply) order?

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

(they do apply & compile like this...)


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-24 23:40:55
Message-ID: a4566dcd-1b8f-32fb-005a-0c39e7c54593@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25/02/17 00:39, Erik Rijkers wrote:
> On 2017-02-25 00:08, Petr Jelinek wrote:
>>
>> There is now a lot of fixes for existing code that this patch depends
>> on. Hopefully some of the fixes get committed soonish.
>
> Indeed - could you look over the below list of 8 patches; is it correct
> and in the right (apply) order?
>
> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>

Yes that should be correct.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-25 08:40:45
Message-ID: 0a4418aff31920c92c1a446ad20d89f3@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-25 00:40, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

Here are some results. There is improvement although it's not an
unqualified success.

Several repeat-runs of pgbench_derail2.sh, with different parameters for
number-of-client yielded an output file each.

Those show that logrep is now pretty stable when there is only 1 client
(pgbench -c 1). But it starts making mistakes with 4, 8, 16 clients.
I'll just show a grep of the output files; I think it is
self-explicatory:

Output-files (lines counted with grep | sort | uniq -c):

-- out_20170225_0129.txt
250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
250 -- All is well.

-- out_20170225_0654.txt
25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
24 -- All is well.
1 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0711.txt
25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
23 -- All is well.
2 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0803.txt
25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
11 -- All is well.
14 -- Not good, but breaking out of wait (waited more than 60s)

So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail

I want to repeat what I said a few emails back: problems seem to
disappear when a short wait state is introduced (directly after the
'alter subscription sub1 enable' line) to give the logrep machinery time
to 'settle'. It makes one think of a timing error somewhere (now don't
ask me where..).

To show that, here is pgbench_derail2.sh output that waited 10 seconds
(INIT_WAIT in the script) as such a 'settle' period works faultless
(with 16 clients):

-- out_20170225_0852.txt
25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
25 -- All is well.

QED.

(By the way, no hanged sessions so far, so that's good)

thanks

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-26 00:45:36
Message-ID: c735e2ac-9745-0e76-6bab-02a8cfdd064b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25/02/17 09:40, Erik Rijkers wrote:
> On 2017-02-25 00:40, Petr Jelinek wrote:
>
>> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>> 0002-Fix-after-trigger-execution-in-logical-replication.patch
>> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>> snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
>>
>> snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>>
>> snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>> snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
>> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>
> Here are some results. There is improvement although it's not an
> unqualified success.
>

Yep, I found yet another bug in snapbuild (see the relevant thread). I
hope with the updated patches from there (now 5 of them) it should
finally work correctly.

> I want to repeat what I said a few emails back: problems seem to
> disappear when a short wait state is introduced (directly after the
> 'alter subscription sub1 enable' line) to give the logrep machinery time
> to 'settle'. It makes one think of a timing error somewhere (now don't
> ask me where..).

I think that's because if you sleep, the COPY finishes in meantime and
so there is no concurrency with pgbench when the initial snapshot is
being taken so that's not solution for production.

>
> (By the way, no hanged sessions so far, so that's good)
>

Great.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-02-26 09:53:21
Message-ID: 2370b8600b167e96af7e6b9748347599@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-26 01:45, Petr Jelinek wrote:

Again, much better... :

-- out_20170226_0724.txt
25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
25 -- All is well.
-- out_20170226_0751.txt
25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
25 -- All is well.
-- out_20170226_0819.txt
25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
25 -- All is well.
-- out_20170226_0844.txt
25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
25 -- All is well.
-- out_20170226_0912.txt
25 -- pgbench -c 32 -j 8 -T 10 -P 5 -n
25 -- All is well.
-- out_20170226_0944.txt
25 -- scale 5 clients 1 INIT_WAIT 0 CLEAN_ONLY=
25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
25 -- All is well.

.... but not perfect: with the next scale up (pgbench scale 25) I got:

-- out_20170226_1001.txt
3 -- scale 25 clients 1 INIT_WAIT 0 CLEAN_ONLY=
3 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
2 -- All is well.
1 -- Not good, but breaking out of wait (waited more than 60s)

It looks like something got stuck at DROP SUBSCRIPTION again which, I
think, derives from this line:
echo "drop subscription if exists sub1" | psql -qXp $port2

I don't know exactly what is useful/useless to report; below is the
state of some tables/views (note that this is from 31 minutes after the
fact (see 'duration' in the first query)), and a backtrace :

$ ./view.sh
select current_setting('port') as port;
port
------
6973
(1 row)

select
rpad(now()::text,19) as now
, pid as pid
, application_name as app
, state as state
, wait_event as wt_evt
, wait_event_type as wt_evt_type
, date_trunc('second', query_start::timestamp) as query_start
, substring((now() - query_start)::text, 1, position('.' in (now() -
query_start)::text)-1) as duration
, query
from pg_stat_activity
where query !~ 'pg_stat_activity'
;
now | pid | app
| state | wt_evt | wt_evt_type | query_start
| duration | query
---------------------+-------+---------------------------------------------+--------+------------------------+-------------+---------------------+----------+----------------------------------
2017-02-26 10:42:43 | 28232 | logical replication worker 31929
| active | relation | Lock |
| |
2017-02-26 10:42:43 | 28237 | logical replication worker 31929 sync
31906 | | LogicalSyncStateChange | IPC |
| |
2017-02-26 10:42:43 | 28242 | logical replication worker 31929 sync
31909 | | transactionid | Lock |
| |
2017-02-26 10:42:43 | 32023 | psql
| active | BgWorkerShutdown | IPC | 2017-02-26 10:10:52
| 00:31:51 | drop subscription if exists sub1
(4 rows)

select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start | backend_xmin | state |
sent_location | write_location | flush_location | replay_location |
sync_priority | sync_state
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+--------------+-------+---------------+----------------+----------------+-----------------+---------------+------------
(0 rows)

select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn |
last_msg_send_time | last_msg_receipt_time |
latest_end_lsn | latest_end_time
-------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
31929 | sub1 | 28242 | 31909 | | 2017-02-26
10:07:05.723093+01 | 2017-02-26 10:07:05.723093+01 | |
2017-02-26 10:07:05.723093+01
31929 | sub1 | 28237 | 31906 | | 2017-02-26
10:07:04.721229+01 | 2017-02-26 10:07:04.721229+01 | |
2017-02-26 10:07:04.721229+01
31929 | sub1 | 28232 | | 1/73497468 |
| 2017-02-26 10:07:47.781883+01 | 1/59A73EF8 | 2017-02-26
10:07:04.720595+01
(3 rows)

select * from pg_subscription;
subdbid | subname | subowner | subenabled | subconninfo | subslotname |
subpublications
---------+---------+----------+------------+-------------+-------------+-----------------
16384 | sub1 | 10 | t | port=6972 | sub1 |
{pub1}
(1 row)

select * from pg_subscription_rel;
srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+------------
31929 | 31912 | i |
31929 | 31917 | i |
31929 | 31909 | d |
31929 | 31906 | w | 1/73498F90
(4 rows)

Dunno if a backtrace is is useful

$ gdb -pid 32023 (from the DROP SUBSCRIPTION line of the first query)
[...]
(gdb) bt
#0 0x00000030fcee8f13 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00000000006fba86 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7fff66e74dc0, cur_timeout=1000, set=0x2d15560) at
latch.c:997
#2 WaitEventSetWait (set=set(at)entry=0x2d15560,
timeout=timeout(at)entry=1000,
occurred_events=occurred_events(at)entry=0x7fff66e74dc0,
nevents=nevents(at)entry=1,
wait_event_info=wait_event_info(at)entry=134217728)
at latch.c:949
#3 0x00000000006fbefc in WaitLatchOrSocket (latch=0x7f74ec2d5cb4,
wakeEvents=wakeEvents(at)entry=25, sock=sock(at)entry=-1,
timeout=timeout(at)entry=1000,
wait_event_info=wait_event_info(at)entry=134217728)
at latch.c:349
#4 0x00000000006fbfd0 in WaitLatch (latch=<optimized out>,
wakeEvents=wakeEvents(at)entry=25, timeout=timeout(at)entry=1000,
wait_event_info=wait_event_info(at)entry=134217728) at latch.c:303
#5 0x00000000006c764f in logicalrep_worker_stop
(subid=subid(at)entry=31929, relid=relid(at)entry=0) at launcher.c:404
#6 0x00000000005c44fa in DropSubscription (stmt=stmt(at)entry=0x2d6fec0)
at subscriptioncmds.c:757
#7 0x0000000000724ab9 in ProcessUtilitySlow
(pstate=pstate(at)entry=0x2d0d978, pstmt=pstmt(at)entry=0x2d70220,
queryString=queryString(at)entry=0x2d6f488 "drop subscription if exists
sub1",
context=context(at)entry=PROCESS_UTILITY_TOPLEVEL,
params=params(at)entry=0x0,
completionTag=completionTag(at)entry=0x7fff66e757f0 "", dest=0x2d70318) at
utility.c:1620
#8 0x0000000000723f69 in standard_ProcessUtility (pstmt=0x2d70220,
queryString=0x2d6f488 "drop subscription if exists sub1",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2d70318,
completionTag=0x7fff66e757f0 "") at utility.c:922
#9 0x00007f74ecd6dd92 in pgss_ProcessUtility (pstmt=0x2d70220,
queryString=0x2d6f488 "drop subscription if exists sub1",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x2d70318,
completionTag=0x7fff66e757f0 "") at pg_stat_statements.c:994
#10 0x0000000000721867 in PortalRunUtility (portal=0x2d05538,
pstmt=0x2d70220, isTopLevel=<optimized out>, setHoldSnapshot=<optimized
out>, dest=<optimized out>, completionTag=0x7fff66e757f0 "")
at pquery.c:1165
#11 0x0000000000722301 in PortalRunMulti (portal=portal(at)entry=0x2d05538,
isTopLevel=isTopLevel(at)entry=1 '\001',
setHoldSnapshot=setHoldSnapshot(at)entry=0 '\000',
dest=dest(at)entry=0x2d70318,
altdest=altdest(at)entry=0x2d70318,
completionTag=completionTag(at)entry=0x7fff66e757f0 "") at pquery.c:1315
#12 0x0000000000722fb8 in PortalRun (portal=portal(at)entry=0x2d05538,
count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=1
'\001', dest=dest(at)entry=0x2d70318, altdest=altdest(at)entry=0x2d70318,
completionTag=completionTag(at)entry=0x7fff66e757f0 "") at pquery.c:788
#13 0x000000000071fefa in exec_simple_query (query_string=0x2d6f488
"drop subscription if exists sub1") at postgres.c:1101
#14 PostgresMain (argc=<optimized out>, argv=argv(at)entry=0x2d18ab8,
dbname=<optimized out>, username=<optimized out>) at postgres.c:4067
#15 0x000000000047571f in BackendRun (port=0x2d0cb40) at
postmaster.c:4310
#16 BackendStartup (port=0x2d0cb40) at postmaster.c:3982
#17 ServerLoop () at postmaster.c:1722
#18 0x00000000006b42a9 in PostmasterMain (argc=argc(at)entry=21,
argv=argv(at)entry=0x2ce5a50) at postmaster.c:1330
#19 0x0000000000636dbc in main (argc=21, argv=0x2ce5a50) at main.c:228

Not yet perfect, but we're getting there...


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-26 15:05:28
Message-ID: 23a51be04c04ad3afecb45ec9a856fdb@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-26 10:53, Erik Rijkers wrote:

> Not yet perfect, but we're getting there...

Sorry, I made a mistake: I was running the newest patches on master but
the older versions on replica (or more precise: I didn't properly
shutdown the replica so the older version remained up and running during
subsequent testing).

So my last email mentioning the 'DROP SUBSCRIPTION' hang error are
hopefully wrong.

I'll get back when I've repeated these tests. This will take some hours
(at least).

Sorry to cause you these palpitations, perhaps unnecessarily...

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-27 10:07:43
Message-ID: e50d715f085e33195689a72c11a401e3@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

With these patches:

-- 0416d87c-09a5-182e-4901-236aec103e9f(at)2ndquadrant(dot)com
Subject: Re: Logical Replication WIP
48.
https://www.postgresql.org/message-id/attachment/49886/0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
49.
https://www.postgresql.org/message-id/attachment/49887/0002-Fix-after-trigger-execution-in-logical-replication.patch
50.
https://www.postgresql.org/message-id/attachment/49888/0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch

-- 51f65289-54f8-2256-d107-937d662d69f1(at)2ndquadrant(dot)com
Subject: Re: snapbuild woes
48.
https://www.postgresql.org/message-id/attachment/49995/snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
49.
https://www.postgresql.org/message-id/attachment/49996/snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
50.
https://www.postgresql.org/message-id/attachment/49997/snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
51.
https://www.postgresql.org/message-id/attachment/49998/snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
52.
https://www.postgresql.org/message-id/attachment/49999/snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch

-- c0f90176-efff-0770-1e79-0249fb4b9b0c(at)2ndquadrant(dot)com
Subject: Re: Logical replication existing data copy
48.
https://www.postgresql.org/message-id/attachment/49977/0001-Logical-replication-support-for-initial-data-copy-v6.patch

logical replication now seems pretty stable, at least for the limited
testcase that I am using. I've done dozens of pgbench_derail2.sh runs
without failure. I am now changing the pgbench-test to larger scale
(pgbench -is) and longer periods (-T) which makes running the test slow
(both instances are running on a modest desktop with a single 7200
disk). It is quite a bit slower than I expected (a 5-minute pgbench
scale 5, with 8 clients, takes, after it has finished on master, another
2-3 minutes to get synced on the replica). I suppose it's just a
hardware limitation. I set max_sync_workers_per_subscription to 6 (from
default 2) but it doesn't help much (at all).

To be continued...

Thanks,

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-27 14:08:10
Message-ID: dea2f43b-d816-f7e2-3042-47fc0bc24013@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27/02/17 11:07, Erik Rijkers wrote:
> With these patches:
>
> -- 0416d87c-09a5-182e-4901-236aec103e9f(at)2ndquadrant(dot)com
> Subject: Re: Logical Replication WIP
> 48.
> https://www.postgresql.org/message-id/attachment/49886/0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
>
> 49.
> https://www.postgresql.org/message-id/attachment/49887/0002-Fix-after-trigger-execution-in-logical-replication.patch
>
> 50.
> https://www.postgresql.org/message-id/attachment/49888/0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
>
>
> -- 51f65289-54f8-2256-d107-937d662d69f1(at)2ndquadrant(dot)com
> Subject: Re: snapbuild woes
> 48.
> https://www.postgresql.org/message-id/attachment/49995/snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
>
> 49.
> https://www.postgresql.org/message-id/attachment/49996/snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>
> 50.
> https://www.postgresql.org/message-id/attachment/49997/snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
>
> 51.
> https://www.postgresql.org/message-id/attachment/49998/snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
>
> 52.
> https://www.postgresql.org/message-id/attachment/49999/snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch
>
>
> -- c0f90176-efff-0770-1e79-0249fb4b9b0c(at)2ndquadrant(dot)com
> Subject: Re: Logical replication existing data copy
> 48.
> https://www.postgresql.org/message-id/attachment/49977/0001-Logical-replication-support-for-initial-data-copy-v6.patch
>
>
>
> logical replication now seems pretty stable, at least for the limited
> testcase that I am using. I've done dozens of pgbench_derail2.sh runs
> without failure. I am now changing the pgbench-test to larger scale
> (pgbench -is) and longer periods (-T) which makes running the test slow
> (both instances are running on a modest desktop with a single 7200
> disk). It is quite a bit slower than I expected (a 5-minute pgbench
> scale 5, with 8 clients, takes, after it has finished on master, another
> 2-3 minutes to get synced on the replica). I suppose it's just a
> hardware limitation. I set max_sync_workers_per_subscription to 6 (from
> default 2) but it doesn't help much (at all).
>
> To be continued...
>

Thanks,

The performance was why in original patch I wanted the apply process to
default to synchronous_commit = off as without it the apply performance
(due to applying transactions individually and in sequences) is quite
lackluster.

It can be worked around using user that has synchronous_commit = off set
via ALTER ROLE as owner of the subscription.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-28 06:38:46
Message-ID: fd04d14826ca856494a9e9717b1f26e6@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-27 15:08, Petr Jelinek wrote:
>
> The performance was why in original patch I wanted the apply process to
> default to synchronous_commit = off as without it the apply performance
> (due to applying transactions individually and in sequences) is quite
> lackluster.
>
> It can be worked around using user that has synchronous_commit = off
> set
> via ALTER ROLE as owner of the subscription.
>

Wow, that's a huge difference in speed.

I set
ALTER ROLE aardvark synchronous_commit = off;

during the first iteration of a 10x pgbench-test (so the first was still
done with it 'on'):
here the pertinent grep | uniq -c lines:

-- out_20170228_0004.txt
10 -- pgbench -c 16 -j 8 -T 900 -P 180 -n -- scale 25
10 -- All is well.
1 -- 1325 seconds total.
9 -- 5 seconds total.

And that 5 seconds is a hardcoded wait; so it's probably even quicker.

This is a slowish machine but that's a really spectacular difference.
It's the difference between keeping up or getting lost.

Would you remind me why synchronous_commit = on was deemed a better
default? This thread isn't very clear about it (not the 'logical
replication WIP' thread).

thanks,

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-02-28 19:36:44
Message-ID: 9f9b45f3de725a1a2f3e19090f1c3e1a@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-02-28 07:38, Erik Rijkers wrote:
> On 2017-02-27 15:08, Petr Jelinek wrote:

> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> +
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> +
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> +
> snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> +
> snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
> +
> snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> +
> snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch
> +
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

This is the most frequent error that happens while doing pgbench-runs
over logical replication: I run it continuously all day, and every few
hours an error occurs of the kind seen below: a table (pgbench_history,
mostly) ends up 1 row short (673466 instead of 673467). I have the
script wait a long time before calling it an error (because in theory it
could still 'finish', and end successfully (although that has not
happened yet, once the system got into this state).

-- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple -- scale 25

[...]
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 20 s) (unchanged 1)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 25 s) (unchanged 2)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 30 s) (unchanged 3)
-- getting md5 (cb)
6972 a,b,t,h: 2500000 25 250 673467 e53236c09 643235708
f952814c3 559d618cd master
6973 a,b,t,h: 2500000 25 250 673466 e53236c09 643235708
f952814c3 4b09337e3 replica NOK a22fb00a6
-- wait another 5 s (total 35 s) (unchanged 4)

I gathered some info in this (proabably deadlocked) state in the hope
there is something suspicious in there:

UID PID PPID C STIME TTY STAT TIME CMD
rijkers 71203 1 0 20:06 pts/57 S 0:00 postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p
6973 -c wal_level=replica [...]
rijkers 71214 71203 0 20:06 ? Ss 0:00 \_ postgres:
logger process
rijkers 71216 71203 0 20:06 ? Ss 0:00 \_ postgres:
checkpointer process
rijkers 71217 71203 0 20:06 ? Ss 0:00 \_ postgres:
writer process
rijkers 71218 71203 0 20:06 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 71219 71203 0 20:06 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 71220 71203 0 20:06 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 71221 71203 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 71222 71203 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker 30042

rijkers 71201 1 0 20:06 pts/57 S 0:00 postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p
6972 -c wal_level=logical [...]
rijkers 71206 71201 0 20:06 ? Ss 0:00 \_ postgres:
logger process
rijkers 71208 71201 0 20:06 ? Ss 0:00 \_ postgres:
checkpointer process
rijkers 71209 71201 0 20:06 ? Ss 0:00 \_ postgres:
writer process
rijkers 71210 71201 0 20:06 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 71211 71201 0 20:06 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 71212 71201 0 20:06 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 71213 71201 0 20:06 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 71223 71201 0 20:06 ? Ss 0:00 \_ postgres: wal
sender process rijkers [local] idle

-- replica:
port | shared_buffers | work_mem | m_w_m | e_c_s
------+----------------+----------+-------+-------
6973 | 100MB | 50MB | 2GB | 64GB
(1 row)

select current_setting('port') as port
, datname as db
, to_char(pg_database_size(datname), '9G999G999G999G999')
|| ' (' || pg_size_pretty(pg_database_size(datname)) || ')' as
dbsize
, pid
, application_name as app
, xact_start
, query_start
, regexp_replace( cast(now() - query_start as text),
E'\.[[:digit:]]+\$', '') as duration
, state
, wait_event_type as wt_evt_type
, wait_event as wt_evt
from pg_stat_activity
where query !~ 'pg_stat_activity'
-- and query != '<IDLE>'
port | db | dbsize | pid | app
| xact_start | query_start | duration | state |
wt_evt_type | wt_evt
------+--------+-----------------------------+-------+----------------------------------+------------+-------------+----------+-------+-------------+------------------
6973 | testdb | 817,243,955 (779 MB) | 71222 | logical
replication worker 30042 | | | | idle |
Activity | LogicalApplyMain
(1 row)

select * from pg_locks;
locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+-----------------+---------+----------
relation | 16384 | 11547 | | | |
| | | | 4/568 | 77713 |
AccessShareLock | t | t
virtualxid | | | | | 4/568 |
| | | | 4/568 | 77713 |
ExclusiveLock | t | t
(2 rows)

select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn |
last_msg_send_time | last_msg_receipt_time |
latest_end_lsn | latest_end_time
-------+---------+-------+-------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------
30042 | sub1 | 71222 | | 29/6C91C980 | 2017-02-28
20:11:50.818774+01 | 2017-02-28 20:11:50.818845+01 | 29/6C91C980 |
2017-02-28 20:11:50.818774+01
(1 row)

select * from pg_subscription;
subdbid | subname | subowner | subenabled | subconninfo | subslotname |
subpublications
---------+---------+----------+------------+-------------+-------------+-----------------
16384 | sub1 | 10 | t | port=6972 | sub1 |
{pub1}
(1 row)

select * from pg_subscription_rel;
srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+----------
30042 | 30022 | r |
30042 | 30019 | r |
30042 | 30030 | r |
30042 | 30025 | r |
(4 rows)

-- master
\c - - - 6972
select pid , usesysid /* , usename */ , application_name /* ,
client_addr */ /* , client_hostname */ /* , client_port */
, backend_start , backend_xmin , state , sent_location ,
write_location , flush_location
, replay_location , sync_priority , sync_state from
pg_stat_replication;
pid | usesysid | application_name | backend_start |
backend_xmin | state | sent_location | write_location |
flush_location | replay_location | sync_priority | sync_state
-------+----------+------------------+-----------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
71223 | 10 | sub1 | 2017-02-28 20:06:50.3357+01 |
| streaming | 29/88BC9BB0 | 29/88BC9BB0 | 29/88BC9BB0 |
29/88BC9BB0 | 0 | async
(1 row)

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-01 19:16:28
Message-ID: 87b4b93e-7632-2ed5-9fa9-9acfebd68013@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28/02/17 20:36, Erik Rijkers wrote:
> This is the most frequent error that happens while doing pgbench-runs
> over logical replication: I run it continuously all day, and every few
> hours an error occurs of the kind seen below: a table (pgbench_history,
> mostly) ends up 1 row short (673466 instead of 673467). I have the
> script wait a long time before calling it an error (because in theory it
> could still 'finish', and end successfully (although that has not
> happened yet, once the system got into this state).
>

Yeah it's unlikely if it's just one row. It suggests there might still
be some snapbuild issue, but I don't immediately see one and I run
couple thousand repeats of the test without getting any inconsistency.
Will continue digging.

>
> I gathered some info in this (proabably deadlocked) state in the hope
> there is something suspicious in there:
>

Hmm that didn't really reveal much :(

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-04 05:46:50
Message-ID: CA+TgmoaN2d5mJnd23Goo_ot-p_=+9zcXXu2z=4CqM0BGOnsFEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 28, 2017 at 12:08 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> Would you remind me why synchronous_commit = on was deemed a better default?

I'm wondering about that, too. If you're trying to do logical
synchronous replication, then maybe there's some argument there,
although even in that case I am not sure it's actually necessary. But
if you're doing asynchronous logical replication, it seems not to make
much sense. I mean, walwriter is going to flush the WAL to disk
within a fraction of a second; why would we wait for that to happen
instead of getting on with replicating the next transaction meanwhile?

(There may well be an aspect to this I'm missing, so please forgive me
if the above is off-base.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-04 06:11:21
Message-ID: 3d55d096-d457-b7ce-071a-e0bde2be0098@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/03/17 06:46, Robert Haas wrote:
> On Tue, Feb 28, 2017 at 12:08 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>> Would you remind me why synchronous_commit = on was deemed a better default?
>
> I'm wondering about that, too. If you're trying to do logical
> synchronous replication, then maybe there's some argument there,
> although even in that case I am not sure it's actually necessary. But
> if you're doing asynchronous logical replication, it seems not to make
> much sense. I mean, walwriter is going to flush the WAL to disk
> within a fraction of a second; why would we wait for that to happen
> instead of getting on with replicating the next transaction meanwhile?
>

I do agree. And even when doing synchronous replication it should not be
problem. We do have code already in place which takes care of reporting
correct write/flush position in the situation where upstream does
syncrep but downstream has synccommit off.

The only anomaly I could think of in terms of having sync commit off on
downstream is that some replicated data that were visible will disappear
after a crash only to reappear again once replication catches up to
where it has been before when doing asynchronous replication. That does
not strike me like good enough reason to have by default something like
order of magnitude worse performance.

So I think we should do it, but it needs to be configurable, my original
patch added GUC for it, Peter wanted it to be configurable per
subscription. I guess we could add it as another option to the list of
WITH (...) options for CREATE and ALTER SUBSCRIPTION.

I am not going to write the patch immediately as I'd like to get the
initial data copy working correctly first (and I've sunk a lot of time
over last couple of weeks into investigation and fixing of various
existing bugs in the snapshot builder so it's taking longer than I
expected). But the patch for this is trivial anyway. Agreement is the
what we need first.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-04 06:46:55
Message-ID: CA+TgmoY7Lk2YKArcp4O=Qu=xOoR8j71MAd1OTEojAWmujE3k0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 4, 2017 at 11:41 AM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 04/03/17 06:46, Robert Haas wrote:
>> On Tue, Feb 28, 2017 at 12:08 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>>> Would you remind me why synchronous_commit = on was deemed a better default?
>> I'm wondering about that, too. If you're trying to do logical
>> synchronous replication, then maybe there's some argument there,
>> although even in that case I am not sure it's actually necessary. But
>> if you're doing asynchronous logical replication, it seems not to make
>> much sense. I mean, walwriter is going to flush the WAL to disk
>> within a fraction of a second; why would we wait for that to happen
>> instead of getting on with replicating the next transaction meanwhile?
>>
>
> I do agree. And even when doing synchronous replication it should not be
> problem. We do have code already in place which takes care of reporting
> correct write/flush position in the situation where upstream does
> syncrep but downstream has synccommit off.
>
> The only anomaly I could think of in terms of having sync commit off on
> downstream is that some replicated data that were visible will disappear
> after a crash only to reappear again once replication catches up to
> where it has been before when doing asynchronous replication. That does
> not strike me like good enough reason to have by default something like
> order of magnitude worse performance.

I see. That's a good reason to have an option, but I agree with your
conclusion.

> So I think we should do it, but it needs to be configurable, my original
> patch added GUC for it, Peter wanted it to be configurable per
> subscription. I guess we could add it as another option to the list of
> WITH (...) options for CREATE and ALTER SUBSCRIPTION.

I don't have a terribly well-considered opinion on this point just
yet, but my initial hunch is that Peter has the right idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-06 10:27:43
Message-ID: 837653af-ec47-dacb-ae33-4a52441fd9ab@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

updated and rebased version of the patch attached.

I changed a bit how the table handover is done between tablesync worker
and apply worker.

I also changed tablesync worker to do whole thing including catchup in
single transaction, that might mean more bloat on very active servers
but it's currently the only way to make sure that if it fails it can
restart as otherwise there would be leftover data from previous attempt
with no way of continuing from the point of failure. This means that
some of the synchronization states now only exist in memory and are
never saved in catalog, I separated those to make that clear. Hopefully
this didn't break anything.

I also fixed thinko about how we handle the LSN returned by
CREATE_REPLICATION_SLOT which should fix the missing row issue Erik
reported earlier (this time it wasn't snapbuild.c issue :) ).

And lastly I changed the automagic around exporting, not exporting and
using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
parameter for the CREATE_REPLICATION_SLOT command (and added simple
framework for adding more of those if needed in the future).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy-v6.patch text/plain 161.7 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-06 15:10:19
Message-ID: 50ef7b5a452c6f56328f8d1adba58c04@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-06 11:27, Petr Jelinek wrote:
> Hi,
>
> updated and rebased version of the patch attached.
>

I compiled with /only/ this one latest patch:
0001-Logical-replication-support-for-initial-data-copy-v6.patch

Is that correct, or are other patches still needed on top, or
underneath?

Anyway, with that one patch, and even after
alter role ... set synchronous_commit = off;
the process is very slow. (sufficiently slow that I haven't
had the patience to see it to completion yet)

What am I doing wrong?

thanks,

Erik Rijkers


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-06 22:40:04
Message-ID: 30adb64f40762a867a4935e57f1c2bb6@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-06 16:10, Erik Rijkers wrote:
> On 2017-03-06 11:27, Petr Jelinek wrote:
>> Hi,
>>
>> updated and rebased version of the patch attached.
>>
>
> I compiled with /only/ this one latest patch:
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>
> Is that correct, or are other patches still needed on top, or
> underneath?
>

TWIMC, I'll answer my own question: the correct patchset seems to be
these six:

0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

These compile, make check, and install fine. make check-world is also
without errors.

Logical replication tests are now running again (no errors yet); they'll
have to run for a few hours with varying parameters to gain some
confidence but it's looking good for the moment.

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-07 03:01:06
Message-ID: fe6a3311-c4fd-3195-6796-baf6c04891cd@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/03/17 23:40, Erik Rijkers wrote:
> On 2017-03-06 16:10, Erik Rijkers wrote:
>> On 2017-03-06 11:27, Petr Jelinek wrote:
>>> Hi,
>>>
>>> updated and rebased version of the patch attached.
>>>
>>
>> I compiled with /only/ this one latest patch:
>> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>>
>> Is that correct, or are other patches still needed on top, or underneath?
>>
>
> TWIMC, I'll answer my own question: the correct patchset seems to be
> these six:
>
> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> 0005-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>

Yes the 0004 and 0005 improve performance of snapshot creation
considerably. They're also the reason we uncovered all the snapbuilder
bugs now. Before the performance optimization it was very unlikely for
some of the race conditions to happen as the snapshot would just not be
built until there was less concurrent traffic on the server.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-07 16:39:22
Message-ID: 69053629-a6a9-f00c-d502-2cdb25025414@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/4/17 01:46, Robert Haas wrote:
>> So I think we should do it, but it needs to be configurable, my original
>> patch added GUC for it, Peter wanted it to be configurable per
>> subscription. I guess we could add it as another option to the list of
>> WITH (...) options for CREATE and ALTER SUBSCRIPTION.
>
> I don't have a terribly well-considered opinion on this point just
> yet, but my initial hunch is that Peter has the right idea.

Basically, we need to have some way of setting this that makes sense in
the global scheme of things. We don't want a situation where "sometimes
it does this, sometimes it does that". I'm not set on any specific
mechanism.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-07 16:44:02
Message-ID: CA+TgmoZ9_OYZE4BGhFy1B=ZgzqY-Oq6MSnZwQUM5MUq46yQa3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 7, 2017 at 11:39 AM, Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> On 3/4/17 01:46, Robert Haas wrote:
>>> So I think we should do it, but it needs to be configurable, my original
>>> patch added GUC for it, Peter wanted it to be configurable per
>>> subscription. I guess we could add it as another option to the list of
>>> WITH (...) options for CREATE and ALTER SUBSCRIPTION.
>>
>> I don't have a terribly well-considered opinion on this point just
>> yet, but my initial hunch is that Peter has the right idea.
>
> Basically, we need to have some way of setting this that makes sense in
> the global scheme of things. We don't want a situation where "sometimes
> it does this, sometimes it does that". I'm not set on any specific
> mechanism.

I think we are on the same page.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-07 22:30:50
Message-ID: c56f1385eacf2a55a6e1aa1d93fe7d6e@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-06 11:27, Petr Jelinek wrote:

> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
> 0005-Skip-unnecessary-snapshot-builds.patch +
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch

I use three different machines (2 desktop, 1 server) to test logical
replication, and all three have now at least once failed to correctly
synchronise a pgbench session (amidst many succesful runs, of course)

I attach an output-file from the test-program, with the 2 logfiles
(master+replica) of the failed run. The outputfile
(out_20170307_1613.txt) contains the output of 5 runs of
pgbench_derail2.sh. The first run failed, the next 4 were ok.

But that's probably not very useful; perhaps is pg_waldump more useful?
From what moment, or leading up to what moment, or period, is a
pg_waldump(s) useful? I can run it from the script, repeatedly, and
only keep the dumped files when things go awry. Would that make sense?

Any other ideas welcome.

thanks,

Erik Rijkers

Attachment Content-Type Size
20170307_1613.tar.bz2 application/x-bzip2 4.7 KB

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-08 09:36:21
Message-ID: 5604d895-8d91-af2d-3c41-5f82d27dadc0@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/03/17 23:30, Erik Rijkers wrote:
> On 2017-03-06 11:27, Petr Jelinek wrote:
>
>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
>> 0005-Skip-unnecessary-snapshot-builds.patch +
>> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>
> I use three different machines (2 desktop, 1 server) to test logical
> replication, and all three have now at least once failed to correctly
> synchronise a pgbench session (amidst many succesful runs, of course)
>
> I attach an output-file from the test-program, with the 2 logfiles
> (master+replica) of the failed run. The outputfile
> (out_20170307_1613.txt) contains the output of 5 runs of
> pgbench_derail2.sh. The first run failed, the next 4 were ok.
>
> But that's probably not very useful; perhaps is pg_waldump more useful?
> From what moment, or leading up to what moment, or period, is a
> pg_waldump(s) useful? I can run it from the script, repeatedly, and
> only keep the dumped files when things go awry. Would that make sense?

Hi,

yes waldump would be useful, the last segment should be enough, but
possibly all segments mentioned in the log.

The other useful thing would be to turn on log_connections and
log_replication_commands.

And finally if you could dump the contents of pg_subscription_rel,
pg_replication_origin_status on subscriber and pg_replication_slots on
publisher at the end of the failed run that would also help.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-09 10:06:57
Message-ID: 93d02794068482f96d31b002e0eb248d@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-08 10:36, Petr Jelinek wrote:
> On 07/03/17 23:30, Erik Rijkers wrote:
>> On 2017-03-06 11:27, Petr Jelinek wrote:
>>
>>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
>>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
>>> 0005-Skip-unnecessary-snapshot-builds.patch +
>>> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>>
>> I use three different machines (2 desktop, 1 server) to test logical
>> replication, and all three have now at least once failed to correctly
>> synchronise a pgbench session (amidst many succesful runs, of course)
>
> yes waldump would be useful, the last segment should be enough, but
> possibly all segments mentioned in the log.

I've inserted a pg_waldump call in the program when the md5s remain the
same (on both master and replica) for 5 wait-cycles (= 5 * 5 seconds +
the time it takes to run (8x) the md5s + select count(*), which is more
than a minute on this slow disk (at least the first time))

>
> The other useful thing would be to turn on log_connections and
> log_replication_commands.

done. (so output will be in the log files.)

> pg_subscription_rel, pg_replication_origin_status on subscriber and
> pg_replication_slots on publisher

done. (added to logs)

The attached bz2 contains
- an output file from pgbench_derail2.sh (also attached, as it changes
somewhat all the time); the

- the pg_waldump output from both master (file with .1. in it) and
replica (.2.).

- the 2 logfiles.

file Name:
logrep.20170309_1021.1.1043.scale_25.clients_64.NOK.log

20170309_1021 is the start-time of the script
1 is master (2 is replica)
1043 is the time, 10:43, just before the pg_waldump call

The tail of the logfiles and of the output file will be a little ahead
of the pg_waldump (not more than a few minutes) because I didn't stop
the script while gathering the files manually.

HTH! Let me know if more is needed (more wal, for instance)

thanks,

Erik Rijkers

PS
the attached script now contains some idiosyncrasies of my setup so it
will probably complain here and there when run unaltered elsewhere)

Attachment Content-Type Size
sent_20170309_1044.tar.bz2 application/x-bzip2 115.6 KB
pgbench_derail2.sh text/x-shellscript 19.8 KB

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-09 10:13:11
Message-ID: 787bfbda11438f1b40a118fabdf74d54@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-09 11:06, Erik Rijkers wrote:

>
> file Name:
> logrep.20170309_1021.1.1043.scale_25.clients_64.NOK.log
>
> 20170309_1021 is the start-time of the script
> 1 is master (2 is replica)
> 1043 is the time, 10:43, just before the pg_waldump call

Sorry, that might be confusing. That 10:43 is the time when script
renames and copies the logfiles (not the waldump)

I meant to show the name of the waldump file:

waldump.20170309_1021_1039.1.5.000000010000002700000069.txt.bz2
where:

20170309_1021 is the start-time of the script
1 is master (2 is replica)
5 is wait-state cycles during which all 8 md5s remained the same
1039 is the time, 10:43, just before the pg_waldump call


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-09 10:28:34
Message-ID: 7076dbd62bb07641cacd8f98cd06bfb6@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-09 11:06, Erik Rijkers wrote:
> On 2017-03-08 10:36, Petr Jelinek wrote:
>> On 07/03/17 23:30, Erik Rijkers wrote:
>>> On 2017-03-06 11:27, Petr Jelinek wrote:
>>>
>>>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>>>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>>>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
>>>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
>>>> 0005-Skip-unnecessary-snapshot-builds.patch +
>>>> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
>>>
>
> The attached bz2 contains
> - an output file from pgbench_derail2.sh (also attached, as it changes
> somewhat all the time); the
>
> - the pg_waldump output from both master (file with .1. in it) and
> replica (.2.).
>
> - the 2 logfiles.
>

I forgot to include the bash-output file. Now attached. This file
should have been in the bz2 I sent a few minutes ago.

Attachment Content-Type Size
out_20170309_1021.txt text/plain 45.2 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-09 17:46:45
Message-ID: f1724d0e-5f00-2f6d-5535-69512b79e133@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/6/17 05:27, Petr Jelinek wrote:
> updated and rebased version of the patch attached.

Some comments on this patch:

Can we have a better explanation of different snapshot options in
CREATE_REPLICATION_SLOT. We use all these variants in different
places, but it's not fully documented why. Think about interested
users reading this.

errmsg("subscription table %u in subscription %u does not exist",

Use names instead of IDs if possible.

+ libpqrcv_table_list,
+ libpqrcv_table_info,
+ libpqrcv_table_copy,

I don't think these functions belong into the WAL receiver API, since
they are specific to this particular logical replication
implementation. I would just make an API function libpqrc_exec_sql()
that runs a query, and then put the table_*() functions as wrappers
around that into tablesync.c.

Not sure what the worker init stuff in ApplyLauncherShmemInit() is
doing. Could be commented more.

There are a lot of places that do one of

MyLogicalRepWorker->relid == InvalidOid
OidIsValid(MyLogicalRepWorker->relid)

To check whether the current worker is a sync worker. Wrap that into
a function.

Not a fan of adding CommandCounterIncrement() calls at the end of
functions. In some cases, they are not necessary at all. In cases
where they are, the CCI call should be at a higher level between two
function calls with a comment for why the call below needs to see the
changes from the call above.

The index name pg_subscription_rel_map_index/SubscriptionRelMapIndexId
doesn't seem to match existing style, since there is no "map" column.
How about pg_subscription_rel_rel_sub_index? I see we have a
similarly named index for publications.

max_sync_workers_per_subscription could be PGC_SIGHUP, I think. And
the minimum could be 0, to stop any syncing?

pg_subscription_rel.h: I'm not sure under which circumstances we need
to use BKI_ROWTYPE_OID.

Does pg_subscription_rel need an OID column? The index
SubscriptionRelOidIndexId is not used anywhere.

You removed this command from the tests:

ALTER SUBSCRIPTION testsub SET PUBLICATION testpub, testpub1;

I suppose because it causes a connection. We should have an option to
prevent that (NOCONNECT/NOREFRESH?).

Why was the test 'check replication origin was dropped on subscriber'
removed?

Attached also a small patch with some stylistic changes.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-fixup-Logical-replication-support-for-initial-data-c.patch application/x-patch 18.2 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-09 17:48:05
Message-ID: e807a237-e392-f78c-92ac-9d68fbe5dfd1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/6/17 05:27, Petr Jelinek wrote:
> And lastly I changed the automagic around exporting, not exporting and
> using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
> parameter for the CREATE_REPLICATION_SLOT command (and added simple
> framework for adding more of those if needed in the future).

It might be useful to make this into a separate patch, for clarity.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-10 04:51:59
Message-ID: 7a53e28a-1f54-2e84-fe0d-427d547abd4f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/03/17 18:46, Peter Eisentraut wrote:
> On 3/6/17 05:27, Petr Jelinek wrote:
>> updated and rebased version of the patch attached.
>
> Some comments on this patch:
>
> Can we have a better explanation of different snapshot options in
> CREATE_REPLICATION_SLOT. We use all these variants in different
> places, but it's not fully documented why. Think about interested
> users reading this.
>

I am not quite sure what/where do you want me to expand the docs, the
protocol doc explains what the options do, are you missing reasoning for
why we use them in the calling code?

>
> errmsg("subscription table %u in subscription %u does not exist",
>
> Use names instead of IDs if possible.
>

Well, this message is more or less equal to cache lookup failure,
problem is that neither relation nor subscription are guaranteed to
exist if this fails. I can write code that spits two different messages
depending of they do, not quite sure if it's worth it there though.

>
> + libpqrcv_table_list,
> + libpqrcv_table_info,
> + libpqrcv_table_copy,
>
> I don't think these functions belong into the WAL receiver API, since
> they are specific to this particular logical replication
> implementation. I would just make an API function libpqrc_exec_sql()
> that runs a query, and then put the table_*() functions as wrappers
> around that into tablesync.c.
>

Yeah I was wondering about it as well. The reason why it's in
libpqwalreciver is that if we create some libpqrc_exec_sql as you say,
we'll need code that converts the PQresult to something consumable by
backend that has no access to libpq API and that seemed like quite a bit
of boilerplate work. I really don't want to write another libpq-like or
SPI-like interface for this. Maybe it would be acceptable to return
result as tuplestore?

>
> Not sure what the worker init stuff in ApplyLauncherShmemInit() is
> doing. Could be commented more.
>

Eh, I copy pasted comment there from different place, will fix.

>
> max_sync_workers_per_subscription could be PGC_SIGHUP, I think. And
> the minimum could be 0, to stop any syncing?
>
>
> pg_subscription_rel.h: I'm not sure under which circumstances we need
> to use BKI_ROWTYPE_OID.
>
>
> Does pg_subscription_rel need an OID column? The index
> SubscriptionRelOidIndexId is not used anywhere.
>

Ah, leftover from the time it used dependencies for tracking.

>
> You removed this command from the tests:
>
> ALTER SUBSCRIPTION testsub SET PUBLICATION testpub, testpub1;
>
> I suppose because it causes a connection. We should have an option to
> prevent that (NOCONNECT/NOREFRESH?).

NOREFRESH makes more sense I guess.

>
> Why was the test 'check replication origin was dropped on subscriber'
> removed?
>

I don't know what you mean?

>
> Attached also a small patch with some stylistic changes.
>

Thanks.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-10 04:52:39
Message-ID: 3795f597-7cc3-ce45-071b-c138bb1e598f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/03/17 18:48, Peter Eisentraut wrote:
> On 3/6/17 05:27, Petr Jelinek wrote:
>> And lastly I changed the automagic around exporting, not exporting and
>> using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
>> parameter for the CREATE_REPLICATION_SLOT command (and added simple
>> framework for adding more of those if needed in the future).
>
> It might be useful to make this into a separate patch, for clarity.
>

Yes, already working on that (at least part of it), since it's useful
for other patches in CF.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-10 16:33:59
Message-ID: d50b2b3b-a01d-93d8-0f73-34de9e710a0b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/03/17 18:48, Peter Eisentraut wrote:
> On 3/6/17 05:27, Petr Jelinek wrote:
>> And lastly I changed the automagic around exporting, not exporting and
>> using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
>> parameter for the CREATE_REPLICATION_SLOT command (and added simple
>> framework for adding more of those if needed in the future).
>
> It might be useful to make this into a separate patch, for clarity.
>

Okay here it is with this part split out. The first patch also help with
Craig's logical decoding on standby so it definitely makes sense to be
split.

This is also rebased on top of the position tracking fix [1] which
together with this patch fixes the error reported by Erik in his
previous email.

I think I resolved all your notes from the review. The only exception
being that the SQL queries are still in libpqwalreceiver.c as we first
need to agree on how the query interface would look like before moving
it around.

[1]
https://www.postgresql.org/message-id/b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Add-option-to-control-snapshot-export-to-CREATE_REPL.patch binary/octet-stream 12.9 KB
0002-Logical-replication-support-for-initial-data-copy.patch binary/octet-stream 162.5 KB

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-13 02:57:28
Message-ID: CAMsr+YGBUvtObMFs5Po5MR4c0khwSkedkjwG4RM_uFMe5eZEXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11 March 2017 at 00:33, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 09/03/17 18:48, Peter Eisentraut wrote:
>> On 3/6/17 05:27, Petr Jelinek wrote:
>>> And lastly I changed the automagic around exporting, not exporting and
>>> using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
>>> parameter for the CREATE_REPLICATION_SLOT command (and added simple
>>> framework for adding more of those if needed in the future).
>>
>> It might be useful to make this into a separate patch, for clarity.
>>
>
> Okay here it is with this part split out. The first patch also help with
> Craig's logical decoding on standby so it definitely makes sense to be
> split.

Greatly appreciated.

Committing this in chunks makes sense anyway.

I've attached a slightly version that makes pg_recvlogical skip slot
export. The second patch is unchanged, use the copy from the
immediately prior mail.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Add-option-to-control-snapshot-export-to-CREATE_REPL.patch text/x-patch 15.7 KB

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-14 16:12:24
Message-ID: 85b5661d-4e7a-9c09-33ec-e1e07c0f1ac3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13/03/17 03:57, Craig Ringer wrote:
> On 11 March 2017 at 00:33, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>> On 09/03/17 18:48, Peter Eisentraut wrote:
>>> On 3/6/17 05:27, Petr Jelinek wrote:
>>>> And lastly I changed the automagic around exporting, not exporting and
>>>> using the snapshot produced by CREATE_REPLICATION_SLOT into explicit
>>>> parameter for the CREATE_REPLICATION_SLOT command (and added simple
>>>> framework for adding more of those if needed in the future).
>>>
>>> It might be useful to make this into a separate patch, for clarity.
>>>
>>
>> Okay here it is with this part split out. The first patch also help with
>> Craig's logical decoding on standby so it definitely makes sense to be
>> split.
>
> Greatly appreciated.
>
> Committing this in chunks makes sense anyway.
>
> I've attached a slightly version that makes pg_recvlogical skip slot
> export. The second patch is unchanged, use the copy from the
> immediately prior mail.
>

Okay, I merged your changes in.

Here is that merge, plus the updated main patch that replaces all the
SQL interaces in libpqwalreceiver.c with single exec one which uses
tuplestore to send any tuple results back.

For correct working in every situation, it still needs to snapbuild
fixes and also the logical replication origin tracking fix.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Add-option-to-control-snapshot-export-to-CREATE_REPL.patch binary/octet-stream 15.7 KB
0002-Logical-replication-support-for-initial-data-copy.patch binary/octet-stream 171.1 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-14 21:39:51
Message-ID: 6220e23a-422b-8433-6c03-0714e2fe4af5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/14/17 12:12, Petr Jelinek wrote:
>> Committing this in chunks makes sense anyway.
>>
>> I've attached a slightly version that makes pg_recvlogical skip slot
>> export. The second patch is unchanged, use the copy from the
>> immediately prior mail.
>>
>
> Okay, I merged your changes in.
>
> Here is that merge, plus the updated main patch that replaces all the
> SQL interaces in libpqwalreceiver.c with single exec one which uses
> tuplestore to send any tuple results back.
>
> For correct working in every situation, it still needs to snapbuild
> fixes and also the logical replication origin tracking fix.

Committed 0001. Will look at 0002 next.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-18 12:28:55
Message-ID: 1b7aa90d-0475-714d-3583-f58eebf915ea@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14/03/17 22:39, Peter Eisentraut wrote:
> On 3/14/17 12:12, Petr Jelinek wrote:
>>> Committing this in chunks makes sense anyway.
>>>
>>> I've attached a slightly version that makes pg_recvlogical skip slot
>>> export. The second patch is unchanged, use the copy from the
>>> immediately prior mail.
>>>
>>
>> Okay, I merged your changes in.
>>
>> Here is that merge, plus the updated main patch that replaces all the
>> SQL interaces in libpqwalreceiver.c with single exec one which uses
>> tuplestore to send any tuple results back.
>>
>> For correct working in every situation, it still needs to snapbuild
>> fixes and also the logical replication origin tracking fix.
>
> Committed 0001. Will look at 0002 next.
>

Cool, rebased the main patch on top of merge conflicts.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy.patch text/x-patch 169.4 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-20 20:08:40
Message-ID: 92bb7401-54f7-fcea-a150-28be5a1e6ff6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The current patch causes a failure in the pg_dump tests, because the
generated CREATE SUBSCRIPTION commands make connection attempts that
don't work. We have the pg_dump option --no-create-subscription-slots
for this, but I suppose we should expand that to
--no-subscription-connect and use the new NOCONNECT option instead.

I'm a bit puzzled at the moment how it worked before, because the
pg_dump test suite does not use that option. But some change like that
is probably useful anyway.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-20 20:25:21
Message-ID: 20170320202521.GB9812@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter,

* Peter Eisentraut (peter(dot)eisentraut(at)2ndquadrant(dot)com) wrote:
> The current patch causes a failure in the pg_dump tests, because the
> generated CREATE SUBSCRIPTION commands make connection attempts that
> don't work. We have the pg_dump option --no-create-subscription-slots
> for this, but I suppose we should expand that to
> --no-subscription-connect and use the new NOCONNECT option instead.

I'll admit that I've not followed this very closely, so perhaps I'm
misunderstanding, but I thought our prior discussion lead to the idea
that pg_dump would always dump out subscriptions with 'NOCONNECT', so
that a restore wouldn't automatically start trying to connect to another
system.

In which case, I don't quite understand why we need a
"--no-subscription-connect" option.

Thanks!

Stephen


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-20 23:54:20
Message-ID: 3ab1504f-558a-c1d4-a088-aba0aa028f16@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20/03/17 21:25, Stephen Frost wrote:
> Peter,
>
> * Peter Eisentraut (peter(dot)eisentraut(at)2ndquadrant(dot)com) wrote:
>> The current patch causes a failure in the pg_dump tests, because the
>> generated CREATE SUBSCRIPTION commands make connection attempts that
>> don't work. We have the pg_dump option --no-create-subscription-slots
>> for this, but I suppose we should expand that to
>> --no-subscription-connect and use the new NOCONNECT option instead.
>
> I'll admit that I've not followed this very closely, so perhaps I'm
> misunderstanding, but I thought our prior discussion lead to the idea
> that pg_dump would always dump out subscriptions with 'NOCONNECT', so
> that a restore wouldn't automatically start trying to connect to another
> system.
>
> In which case, I don't quite understand why we need a
> "--no-subscription-connect" option.
>

Yeah I think that was the agreement as well.

Here is fixed version, also rebased on top of all the changes to pg_dump
tests. Subscriptions are dumped unless --no-subscriptions is specified.
They are dumped with NOCONNECT option (which didn't exist before this
patch so it was bit harder to do the dump by default).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy.patch text/x-patch 187.8 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-22 01:36:45
Message-ID: 38ff17c1-a1ff-87b0-9b4a-65565e860b32@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/20/17 19:54, Petr Jelinek wrote:
> Here is fixed version, also rebased on top of all the changes to pg_dump
> tests. Subscriptions are dumped unless --no-subscriptions is specified.

The problem with that is that pg_dump will now fail for unprivileged
users. That's a separate problem to solve. I suggest not overloading
this patch with that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-22 01:38:34
Message-ID: 6f97b492-d90b-bed0-b8c5-6c73484b4cfa@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This patch is looking pretty good to me, modulo the failing pg_dump tests.

Attached is a fixup patch. I have mainly updated some comments and
variable naming for (my) clarity. No functional changes.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-fixup-Logical-replication-support-for-initial-data-c.patch application/x-patch 24.8 KB

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-22 12:36:26
Message-ID: 86793838-ab80-5ec2-447a-95a16d07bce2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/20/17 16:08, Peter Eisentraut wrote:
> The current patch causes a failure in the pg_dump tests, because the
> generated CREATE SUBSCRIPTION commands make connection attempts that
> don't work. We have the pg_dump option --no-create-subscription-slots
> for this, but I suppose we should expand that to
> --no-subscription-connect and use the new NOCONNECT option instead.

Small top-up patch to accomplish that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-fixup-Logical-replication-support-for-initial-data-c.patch invalid/octet-stream 5.1 KB

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication existing data copy
Date: 2017-03-22 12:46:02
Message-ID: fb786484-6783-3bfd-5495-9f25c4cb3008@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22/03/17 13:36, Peter Eisentraut wrote:
> On 3/20/17 16:08, Peter Eisentraut wrote:
>> The current patch causes a failure in the pg_dump tests, because the
>> generated CREATE SUBSCRIPTION commands make connection attempts that
>> don't work. We have the pg_dump option --no-create-subscription-slots
>> for this, but I suppose we should expand that to
>> --no-subscription-connect and use the new NOCONNECT option instead.
>
> Small top-up patch to accomplish that.
>

Works for me. I'd maybe explicitly mention in the docs for
---no-subscription-connect that it dumps subscriptions as CREATE
SUBSCRIPTION ... WITH(NOCONNECT).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-23 13:00:16
Message-ID: 28878ea1-7cb5-3d45-f64b-b806461832e5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/21/17 21:38, Peter Eisentraut wrote:
> This patch is looking pretty good to me, modulo the failing pg_dump tests.
>
> Attached is a fixup patch. I have mainly updated some comments and
> variable naming for (my) clarity. No functional changes.

Committed all that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-23 22:06:46
Message-ID: 20170323220645.GA17115@nighthawk.caipicrew.dd-dns.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thu, Mar 23, 2017 at 09:00:16AM -0400, Peter Eisentraut wrote:
> On 3/21/17 21:38, Peter Eisentraut wrote:
> > This patch is looking pretty good to me, modulo the failing pg_dump tests.
> >
> > Attached is a fixup patch. I have mainly updated some comments and
> > variable naming for (my) clarity. No functional changes.
>
> Committed all that.

Maybe I'm doing something wrong, but I'm getting a segfault trying to
start logical replication since that patch went in.

I've blown away my build tree and started over with this minimal
example, any obvious mistakes here? Quoting the publisher/subscriber
names doesn't seem to change things:

$ initdb --pgdata=data_pg1
$ sed -i -e 's/^#wal_level.=.replica/wal_level = logical/' data_pg1/postgresql.conf
$ pg_ctl --pgdata=data_pg1 -l pg1.log start
$ pg_basebackup --pgdata=data_pg2
$ sed -i -e 's/^#port.=.5432/port = 5433/' data_pg2/postgresql.conf
$ pg_ctl --pgdata=data_pg2 -l pg2.log start
$ psql -c "CREATE PUBLICATION pub1;"
$ psql --port=5433 -c "CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=postgres' PUBLICATION pub1;"

Backtrace:

Program received signal SIGSEGV, Segmentation fault.
ExecSetSlotDescriptor (slot=slot(at)entry=0xfc4d38, tupdesc=tupdesc(at)entry=0x0)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/executor/execTuples.c:269
269 PinTupleDesc(tupdesc);
(gdb) bt
#0 ExecSetSlotDescriptor (slot=slot(at)entry=0xfc4d38, tupdesc=tupdesc(at)entry=0x0)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/executor/execTuples.c:269
#1 0x00000000005dc4fc in MakeSingleTupleTableSlot (tupdesc=0x0)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/executor/execTuples.c:203
#2 0x00000000005a16ff in fetch_table_list (wrconn=wrconn(at)entry=0xc00300000001,
publications=publications(at)entry=0xffb448)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/commands/subscriptioncmds.c:996
#3 0x00000000005a1efa in CreateSubscription (stmt=0x101cd40, isTopLevel=<optimized out>)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/commands/subscriptioncmds.c:396
#4 0x00000000006ecf96 in ProcessUtilitySlow (pstate=0x0, pstate(at)entry=0xfc4818, pstmt=0x0,
pstmt(at)entry=0x101d080, queryString=0xf9d248 <incomplete sequence \340>,
queryString(at)entry=0x101c0b8 "CREATE SUBSCRIPTION \"sub1\" CONNECTION 'dbname=postgres' PUBLICATION pub1 WITH (COPY DATA);", context=(unknown: 16534992), context(at)entry=PROCESS_UTILITY_TOPLEVEL,
params=0x7ffd7e6263d0, params(at)entry=0x0,
completionTag=0x45 <error: Cannot access memory at address 0x45>,
completionTag(at)entry=0x7ffd7e626b00 "", dest=<optimized out>)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/utility.c:1612
#5 0x00000000006ec4e9 in standard_ProcessUtility (pstmt=0x101d080,
queryString=0x101c0b8 "CREATE SUBSCRIPTION \"sub1\" CONNECTION 'dbname=postgres' PUBLICATION pub1 WITH (COPY DATA);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x101d160,
completionTag=0x7ffd7e626b00 "")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/utility.c:922
#6 0x00000000006e9f5f in PortalRunUtility (portal=0xfbb9d8, pstmt=0x101d080,
isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>,
completionTag=0x7ffd7e626b00 "")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/pquery.c:1165
#7 0x00000000006ea977 in PortalRunMulti (portal=portal(at)entry=0xfbb9d8,
isTopLevel=isTopLevel(at)entry=1 '\001', setHoldSnapshot=setHoldSnapshot(at)entry=0 '\000',
dest=dest(at)entry=0x101d160, altdest=altdest(at)entry=0x101d160,
completionTag=completionTag(at)entry=0x7ffd7e626b00 "")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/pquery.c:1315
#8 0x00000000006eb4cb in PortalRun (portal=portal(at)entry=0xfbb9d8,
count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=1 '\001',
dest=dest(at)entry=0x101d160, altdest=altdest(at)entry=0x101d160,
completionTag=completionTag(at)entry=0x7ffd7e626b00 "")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/pquery.c:788
#9 0x00000000006e7868 in exec_simple_query (
query_string=0x101c0b8 "CREATE SUBSCRIPTION \"sub1\" CONNECTION 'dbname=postgres' PUBLICATION pub1 WITH (COPY DATA);")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/postgres.c:1101
#10 0x00000000006e94a9 in PostgresMain (argc=1, argv=0x101c0b8, dbname=0xfc8478 "postgres",
username=0xfc8458 "postgres")
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/tcop/postgres.c:4069
#11 0x00000000004770ad in BackendRun (port=0xfc2970)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/postmaster/postmaster.c:4317
#12 BackendStartup (port=0xfc2970)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/postmaster/postmaster.c:3989
#13 ServerLoop ()
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/postmaster/postmaster.c:1729
#14 0x000000000068819c in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0xf9c420)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/postmaster/postmaster.c:1337
#15 0x0000000000477f27 in main (argc=3, argv=0xf9c420)
at /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/backend/main/main.c:228

The TAP tests in src/test/subscriber pass for me, going back to
707576b571f05ec5b89adb65964d55f3ccccbd1b also makes the above chain of
commands run fine.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-23 22:22:52
Message-ID: e85135e8-688c-9776-570e-73028454ebda@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 23/03/17 23:06, Michael Banck wrote:
> Hi,
>
> On Thu, Mar 23, 2017 at 09:00:16AM -0400, Peter Eisentraut wrote:
>> On 3/21/17 21:38, Peter Eisentraut wrote:
>>> This patch is looking pretty good to me, modulo the failing pg_dump tests.
>>>
>>> Attached is a fixup patch. I have mainly updated some comments and
>>> variable naming for (my) clarity. No functional changes.
>>
>> Committed all that.
>
> Maybe I'm doing something wrong, but I'm getting a segfault trying to
> start logical replication since that patch went in.
>

Thanks for report. Looks like we don't handle correctly empty result set
when fetching table list. Will post fix tomorrow.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-23 23:14:25
Message-ID: b1772c05-b761-1542-f8f8-42b5515cca06@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/03/17 02:00, Peter Eisentraut wrote:
> On 3/21/17 21:38, Peter Eisentraut wrote:
>> This patch is looking pretty good to me, modulo the failing pg_dump tests.
>>
>> Attached is a fixup patch. I have mainly updated some comments and
>> variable naming for (my) clarity. No functional changes.
> Committed all that.
>

Testing now this patch is in, I'm unable to create a subscription:

(master)

bench=# CREATE PUBLICATION pgbench
FOR TABLE pgbench_accounts , pgbench_branches,
pgbench_tellers, pgbench_history
WITH (PUBLISH INSERT, PUBLISH UPDATE, PUBLISH DELETE);

(slave)

bench=# CREATE SUBSCRIPTION pgbench
CONNECTION 'port=5447 user=postgres dbname=bench'
PUBLICATION pgbench
WITH (COPY DATA);
ERROR: duplicate key value violates unique constraint
"pg_subscription_rel_srrelid_srsubid_index"
DETAIL: Key (srrelid, srsubid)=(0, 16389) already exists.

This is a pair of freshly initdb'ed instances, the master has a size 100
pgbench schema.

I'm guessing this is a different bug from the segfault also reported

regards

Mark


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-23 23:32:28
Message-ID: ef24fe09-ee62-a375-74da-71bfe26145c3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/03/17 00:14, Mark Kirkwood wrote:
> On 24/03/17 02:00, Peter Eisentraut wrote:
>> On 3/21/17 21:38, Peter Eisentraut wrote:
>>> This patch is looking pretty good to me, modulo the failing pg_dump
>>> tests.
>>>
>>> Attached is a fixup patch. I have mainly updated some comments and
>>> variable naming for (my) clarity. No functional changes.
>> Committed all that.
>>
>
> Testing now this patch is in, I'm unable to create a subscription:
>
> (master)
>
> bench=# CREATE PUBLICATION pgbench
> FOR TABLE pgbench_accounts , pgbench_branches,
> pgbench_tellers, pgbench_history
> WITH (PUBLISH INSERT, PUBLISH UPDATE, PUBLISH DELETE);
>
> (slave)
>
> bench=# CREATE SUBSCRIPTION pgbench
> CONNECTION 'port=5447 user=postgres dbname=bench'
> PUBLICATION pgbench
> WITH (COPY DATA);
> ERROR: duplicate key value violates unique constraint
> "pg_subscription_rel_srrelid_srsubid_index"
> DETAIL: Key (srrelid, srsubid)=(0, 16389) already exists.
>
> This is a pair of freshly initdb'ed instances, the master has a size 100
> pgbench schema.
>
> I'm guessing this is a different bug from the segfault also reported
>

Yes, I also forgot to check if the table actually exists on subscriber
when fetching them in CREATE SUBSCRIPTION (we have check during
replication but not there).

Attached patches should fix both issues.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0002-Check-that-published-table-exists-on-subscriber.patch text/x-patch 1.4 KB
0001-Always-return-tupleslot-and-tupledesc-from-libpqrcv_.patch text/x-patch 1.5 KB

From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 09:22:43
Message-ID: 20170324092242.GC17115@nighthawk.caipicrew.dd-dns.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Fri, Mar 24, 2017 at 12:32:28AM +0100, Petr Jelinek wrote:
> Yes, I also forgot to check if the table actually exists on subscriber
> when fetching them in CREATE SUBSCRIPTION (we have check during
> replication but not there).
>
> Attached patches should fix both issues.

I no longer get a segfault, thanks.

However, replication also seems to not work, I'm using the following
script right now:

--8<--
#!/bin/sh

set -x

#rm -rf data_* pg*.log

initdb --pgdata=data_pg1 1> /dev/null 2>&1
sed -i -e 's/^#wal_level.=.replica/wal_level = logical/'
data_pg1/postgresql.conf
pg_ctl --pgdata=data_pg1 -l pg1.log start 1> /dev/null
psql -c "CREATE TABLE t1(id int);"
pg_basebackup --pgdata=data_pg2
sed -i -e 's/^#port.=.5432/port = 5433/' data_pg2/postgresql.conf
psql -c "INSERT INTO t1 VALUES(1);"
pg_ctl --pgdata=data_pg2 -l pg2.log start 1> /dev/null
psql -c "CREATE PUBLICATION pub1;"
psql --port=5433 -c "CREATE SUBSCRIPTION sub1 CONNECTION
'dbname=postgres' PUBLICATION pub1 WITH (COPY DATA);"
sleep 6
psql -c 'SELECT application_name, state FROM pg_stat_replication;'
psql --port=5433 -c "SELECT COUNT(*) FROM t1;"
--8<--

(I had to add the sleep 6 - 5 wasn't always enough - in order to get the
subcription from 'catchup' to 'streaming' which was a bit surprising to
me)

This is the output:

--8<--
+ initdb --pgdata=data_pg1
+ sed -i -e s/^#wal_level.=.replica/wal_level = logical/
data_pg1/postgresql.conf
+ pg_ctl --pgdata=data_pg1 -l pg1.log start
+ psql -c CREATE TABLE t1(id int);
CREATE TABLE
+ pg_basebackup --pgdata=data_pg2
+ sed -i -e s/^#port.=.5432/port = 5433/ data_pg2/postgresql.conf
+ psql -c INSERT INTO t1 VALUES(1);
INSERT 0 1
+ pg_ctl --pgdata=data_pg2 -l pg2.log start
+ psql -c CREATE PUBLICATION pub1;
CREATE PUBLICATION
+ psql --port=5433 -c CREATE SUBSCRIPTION sub1 CONNECTION
'dbname=postgres' PUBLICATION pub1 WITH (COPY DATA);
NOTICE: created replication slot "sub1" on publisher
NOTICE: synchronized table states
CREATE SUBSCRIPTION
+ sleep 6
+ psql -c SELECT application_name, state FROM pg_stat_replication;
application_name | state
------------------+-----------
sub1 | streaming
(1 row)

+ psql --port=5433 -c SELECT COUNT(*) FROM t1;
count
-------
0
(1 row)
--8<--

I would have assumed the inserted value to be visible on the standby.
If I insert further values, don't show up, either.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 09:45:53
Message-ID: 70068532-c368-d65f-172a-a5b8b83dc8b1@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/03/17 12:32, Petr Jelinek wrote:

> On 24/03/17 00:14, Mark Kirkwood wrote:
>> On 24/03/17 02:00, Peter Eisentraut wrote:
>>> On 3/21/17 21:38, Peter Eisentraut wrote:
>>>> This patch is looking pretty good to me, modulo the failing pg_dump
>>>> tests.
>>>>
>>>> Attached is a fixup patch. I have mainly updated some comments and
>>>> variable naming for (my) clarity. No functional changes.
>>> Committed all that.
>>>
>> Testing now this patch is in, I'm unable to create a subscription:
>>
>> (master)
>>
>> bench=# CREATE PUBLICATION pgbench
>> FOR TABLE pgbench_accounts , pgbench_branches,
>> pgbench_tellers, pgbench_history
>> WITH (PUBLISH INSERT, PUBLISH UPDATE, PUBLISH DELETE);
>>
>> (slave)
>>
>> bench=# CREATE SUBSCRIPTION pgbench
>> CONNECTION 'port=5447 user=postgres dbname=bench'
>> PUBLICATION pgbench
>> WITH (COPY DATA);
>> ERROR: duplicate key value violates unique constraint
>> "pg_subscription_rel_srrelid_srsubid_index"
>> DETAIL: Key (srrelid, srsubid)=(0, 16389) already exists.
>>
>> This is a pair of freshly initdb'ed instances, the master has a size 100
>> pgbench schema.
>>
>> I'm guessing this is a different bug from the segfault also reported
>>
> Yes, I also forgot to check if the table actually exists on subscriber
> when fetching them in CREATE SUBSCRIPTION (we have check during
> replication but not there).
>
> Attached patches should fix both issues.
>
>

Yep, does seem to.

I note that (probably intensional) specifying 'COPY DATA' does not
'CREATE TABLES' for you...ok, I probably didn't read ...something
somewhere...but anyway, after creating the tables it all seems to work.
Nice.

However one minor observation - as Michael Banck noted - the elapsed
time for slave to catch up after running:

$ pgbench -c8 -T600 bench

on the master was (subjectively) much longer than for physical streaming
replication. Is this expected?

regards

Mark


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 14:05:19
Message-ID: eb57f282-b672-e732-0d5b-ce4c4b54a2ae@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/23/17 19:32, Petr Jelinek wrote:
> Yes, I also forgot to check if the table actually exists on subscriber
> when fetching them in CREATE SUBSCRIPTION (we have check during
> replication but not there).

I think for this we can probably just change the missing_ok argument of
RangeVarGetRelid() to false.

Unless we want the custom error message, in which case we need to change
AlterSubscription_refresh(), because right now it errors out because
missing_ok = false.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 14:09:38
Message-ID: 63e9613f-258b-8f49-42ea-58dafb6e9ba3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/03/17 15:05, Peter Eisentraut wrote:
> On 3/23/17 19:32, Petr Jelinek wrote:
>> Yes, I also forgot to check if the table actually exists on subscriber
>> when fetching them in CREATE SUBSCRIPTION (we have check during
>> replication but not there).
>
> I think for this we can probably just change the missing_ok argument of
> RangeVarGetRelid() to false.
>
> Unless we want the custom error message, in which case we need to change
> AlterSubscription_refresh(), because right now it errors out because
> missing_ok = false.
>

You are right, stupid me.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0002-Check-that-published-table-exists-on-subscriber.patch text/x-patch 907 bytes

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 18:52:50
Message-ID: 5369132b-3c21-8817-0862-d83893041301@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/24/17 10:09, Petr Jelinek wrote:
> On 24/03/17 15:05, Peter Eisentraut wrote:
>> On 3/23/17 19:32, Petr Jelinek wrote:
>>> Yes, I also forgot to check if the table actually exists on subscriber
>>> when fetching them in CREATE SUBSCRIPTION (we have check during
>>> replication but not there).
>>
>> I think for this we can probably just change the missing_ok argument of
>> RangeVarGetRelid() to false.
>>
>> Unless we want the custom error message, in which case we need to change
>> AlterSubscription_refresh(), because right now it errors out because
>> missing_ok = false.
>>
>
> You are right, stupid me.

Committed this version.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Michael Banck <michael(dot)banck(at)credativ(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 18:57:52
Message-ID: 5901150e-9d27-dc8d-a098-c9bd47869547@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/24/17 05:22, Michael Banck wrote:
> However, replication also seems to not work, I'm using the following
> script right now:

The problem is that your publication does not include any tables.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Michael Banck <michael(dot)banck(at)credativ(dot)de>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-24 20:42:59
Message-ID: 1490388179.24902.0.camel@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Freitag, den 24.03.2017, 14:57 -0400 schrieb Peter Eisentraut:
> On 3/24/17 05:22, Michael Banck wrote:
> > However, replication also seems to not work, I'm using the following
> > script right now:
>
> The problem is that your publication does not include any tables.

Oops, of course. That part must've got lost in a copy-paste or when I
tried to dig further why the CREATE SUBSCRIPTION segfaulted, sorry for
the noise.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael(dot)banck(at)credativ(dot)de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-25 04:45:14
Message-ID: 23540acf-526e-6f49-e94a-88692cc3d92f@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 25/03/17 07:52, Peter Eisentraut wrote:
> On 3/24/17 10:09, Petr Jelinek wrote:
>> On 24/03/17 15:05, Peter Eisentraut wrote:
>>> On 3/23/17 19:32, Petr Jelinek wrote:
>>>> Yes, I also forgot to check if the table actually exists on subscriber
>>>> when fetching them in CREATE SUBSCRIPTION (we have check during
>>>> replication but not there).
>>> I think for this we can probably just change the missing_ok argument of
>>> RangeVarGetRelid() to false.
>>>
>>> Unless we want the custom error message, in which case we need to change
>>> AlterSubscription_refresh(), because right now it errors out because
>>> missing_ok = false.
>>>
>> You are right, stupid me.
> Committed this version.
>

Minor niggle:

bench=# DROP PUBLICATION pgbench;
DROP STATISTICS <===

I'm guessing that notification is wrong.

regards

Mark


From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-25 04:52:17
Message-ID: fba744fa-4727-950e-7745-48d2f988a61b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/25/17 00:45, Mark Kirkwood wrote:
> Minor niggle:
>
> bench=# DROP PUBLICATION pgbench;
> DROP STATISTICS <===
>
> I'm guessing that notification is wrong.

Fixed.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-29 08:14:28
Message-ID: 03a1305e84a17ad729f4075fcf3f4767@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-03-09 11:06, Erik Rijkers wrote:
>>>
>>> I use three different machines (2 desktop, 1 server) to test logical
>>> replication, and all three have now at least once failed to correctly
>>> synchronise a pgbench session (amidst many succesful runs, of course)
>>

(At the moment using tese patches for tests:)

> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
> 0005-Skip-unnecessary-snapshot-builds.patch +

The failed tests that I kept seeing (see the
pgbench-over-logical-replication tests upthread) were never really
'solved'.

But I have now finally figured out what caused these unexpected failed
tests: it was wal_sender_timeout or rather, its default of 60 s.

This caused 'terminating walsender process due to replication timeout'
on the primary (not strictly an error), and the concomittant ERROR on
the replica: 'could not receive data from WAL stream: server closed the
connection unexpectedly'.

here is a typical example (primary/replica logs time-intertwined, with
'primary'):

[...]
2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale
statistics instead of current ones because stats collector is not
responding
2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating
walsender process due to
2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not
receive data from WAL stream: server closed the connection unexpectedly
2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process:
logical replication worker for subscription 24864 (PID 14926) exited
with exit code 1
2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical
replication worker for subscription "sub1"
2017-03-24 16:21:49.828 CET [15008] replica LOG: logical
replication apply for subscription sub1 started
2017-03-24 16:21:49.832 CET [15009] primary LOG: received
replication command: IDENTIFY_SYSTEM
2017-03-24 16:21:49.832 CET [15009] primary LOG: received
replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440
(proto_version '1', publication_names '"pub1"')
2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming
transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical
decoding for slot "sub1"
2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical
decoding will begin using saved snapshot.
2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding
found consistent point at 3/FC820FC0
2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key
(hid)=(9014) already exists.
2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key
value violates unique constraint "pgbench_history_pkey"
2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process:
logical replication worker for subscription 24864 (PID 15008) exited
with exit code 1
2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical
replication worker for subscription "sub1"
[...]

My primary and replica were always on a single machine (making it more
likely that that timeout is reached?). In my testing it seems that
reaching the timeout on the primary (and 'closing the connection
unexpectedly' on the replica) does not necessarily break the logical
replication. But almost all log-rep failures that I have seen were
started by this sequence of events.

After setting wal_sender_timeout to 3 minutes there were no more
failed tests.

Perhaps it warrants setting wal_sender_timeout a bit higher than the
current default of 60 seconds? After all I also saw the 'replication
timeout' / 'closed the connection' couple rather often during
not-failing tests. (These also disappeared, almost completely, with a
higher setting of wal_sender_timeout)

In any case it would be good to mention the setting (and its potentially
deteriorating effect) somehere nearer the logical replication treatment.

( I read about wal_sender_timeout and keepalive ping, perhaps there's
(still) something amiss there? Just a guess, I don't know )

As I said, I saw no more failures with the higher 3 minute setting, with
one exception: the one test that straddled the DST change (saterday 24
march 02:00 h). I am happy to discount that one failure but strictly
speaking I suppose it should be able to take DST into its stride.

Thanks,

Erik Rijkers


From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-29 13:06:41
Message-ID: 1f810a1c-d57b-e8af-44cb-2e55defb30f8@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29/03/17 10:14, Erik Rijkers wrote:
> On 2017-03-09 11:06, Erik Rijkers wrote:
>>>>
>>>> I use three different machines (2 desktop, 1 server) to test logical
>>>> replication, and all three have now at least once failed to correctly
>>>> synchronise a pgbench session (amidst many succesful runs, of course)
>>>
>
>
> (At the moment using tese patches for tests:)
>
>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
>> 0005-Skip-unnecessary-snapshot-builds.patch +
>
>
> The failed tests that I kept seeing (see the
> pgbench-over-logical-replication tests upthread) were never really
> 'solved'.
>
>
> But I have now finally figured out what caused these unexpected failed
> tests: it was wal_sender_timeout or rather, its default of 60 s.
>
> This caused 'terminating walsender process due to replication timeout'
> on the primary (not strictly an error), and the concomittant ERROR on
> the replica: 'could not receive data from WAL stream: server closed the
> connection unexpectedly'.
>
> here is a typical example (primary/replica logs time-intertwined, with
> 'primary'):
>
> [...]
> 2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating
> walsender process due to
> 2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not
> receive data from WAL stream: server closed the connection unexpectedly
> 2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process:
> logical replication worker for subscription 24864 (PID 14926) exited
> with exit code 1
> 2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical
> replication worker for subscription "sub1"
> 2017-03-24 16:21:49.828 CET [15008] replica LOG: logical
> replication apply for subscription sub1 started
> 2017-03-24 16:21:49.832 CET [15009] primary LOG: received
> replication command: IDENTIFY_SYSTEM
> 2017-03-24 16:21:49.832 CET [15009] primary LOG: received
> replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440
> (proto_version '1', publication_names '"pub1"')
> 2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming
> transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
> 2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical
> decoding for slot "sub1"
> 2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical
> decoding will begin using saved snapshot.
> 2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding
> found consistent point at 3/FC820FC0
> 2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key
> (hid)=(9014) already exists.
> 2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key
> value violates unique constraint "pgbench_history_pkey"
> 2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process:
> logical replication worker for subscription 24864 (PID 15008) exited
> with exit code 1
> 2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical
> replication worker for subscription "sub1"
> [...]
>
> My primary and replica were always on a single machine (making it more
> likely that that timeout is reached?). In my testing it seems that
> reaching the timeout on the primary (and 'closing the connection
> unexpectedly' on the replica) does not necessarily break the logical
> replication. But almost all log-rep failures that I have seen were
> started by this sequence of events.
>
> After setting wal_sender_timeout to 3 minutes there were no more
> failed tests.
>
> Perhaps it warrants setting wal_sender_timeout a bit higher than the
> current default of 60 seconds? After all I also saw the 'replication
> timeout' / 'closed the connection' couple rather often during
> not-failing tests. (These also disappeared, almost completely, with a
> higher setting of wal_sender_timeout)
>
> In any case it would be good to mention the setting (and its potentially
> deteriorating effect) somehere nearer the logical replication treatment.
>
> ( I read about wal_sender_timeout and keepalive ping, perhaps there's
> (still) something amiss there? Just a guess, I don't know )
>
> As I said, I saw no more failures with the higher 3 minute setting, with
> one exception: the one test that straddled the DST change (saterday 24
> march 02:00 h). I am happy to discount that one failure but strictly
> speaking I suppose it should be able to take DST into its stride.
>
>

Hi,

I think what you have seen is because of this:
https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1abb2e(at)2ndquadrant(dot)com#b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Erik Rijkers <er(at)xs4all(dot)nl>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-30 21:44:22
Message-ID: 9614f59cbb938b65dfa22c292eea4f85@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


(At the moment using these patches for tests:)
0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
0005-Skip-unnecessary-snapshot-builds.patch +
and now (Tuesday 30) added :
0001-Fix-remote-position-tracking-in-logical-replication.patch

> I think what you have seen is because of this:
> https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1abb2e(at)2ndquadrant(dot)com#b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com
>

You were right: with that 6th patch (and wal_sender_timout back at its
default 60s) there are no errors either (I tested on all 3
test-machines).

I must have missed that last patch when you posted it. Anyway all seems
fine now; I hope the above patches can all be committed soon.

thanks,

Erik Rijkers