Re: Streaming replication on 9.1-beta2 after pg_restore is very slow

Lists: pgsql-general
From: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-06 15:54:17
Message-ID: 0317654684C3CF48B06D8FF5AE5D2EE0CCE0@Win-Exchange-02.MENDIXDOMAIN.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi all,

I am experimenting with (synchronous) streaming replication on postgresql 9.1 beta 2 and am having performance problems. I have initially set up an (asynchronous) streaming replication master cluster with postgresql 9.0, which streamed to a single slave cluster. This seemed to work quite well. Then I've mostly copied the configuration to a 9.1 beta 2 cluster (master and slave) to see how synchronous replication would behave.

The master cluster, when empty after an initdb (pg_createcluster on debian) seems to properly stream changes to one or more slave clusters when correctly set up. I watch the master and slave with pg_current_xlog_location() on master and pg_last_xlog_receive_location() and pg_last_xlog_replay_location(). It seems that slaves pick up changes, such as a simple database creation, or updating a role password, or adding a role. But when I then do a restoredb on the master, the slaves quickly fall behind and only very slowly catch up (after maybe an hour or something...)

The log on the slave is filled with statements similar to the following:
LOG: streaming replication successfully connected to primary
LOG: record with zero length at 0/9B7A010
FATAL: terminating walreceiver process due to administrator command

The slave log file also contains the following line a number of times (with the numbers of course a bit different every time):
LOG: invalid magic number 0000 in log file 0, segment 9, offset 10878976

The log on the master contains several lines with:
LOG: could not send data to client: Connection reset by peer

Is there possibly a known issue with the beta, or do I have to configure my cluster differently for 9.1? I'm a bit lost, and would appreciate any comments. Below, I've added server configurations. I'm running postgresql from debian packages from the experimental suite.

Thanks,
David Hartveld
--
== Master configuration ==
"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"
"archive_command";"cp %p /walshipping/9.1/sr-master/%f"
"archive_mode";"on"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"external_pid_file";"/var/run/postgresql/9.1-sr-master.pid"
"lc_collate";"en_US.UTF-8"
"lc_ctype";"en_US.UTF-8"
"listen_addresses";"*"
"log_line_prefix";"%t "
"max_connections";"100"
"max_stack_depth";"2MB"
"max_wal_senders";"3"
"port";"5434"
"server_encoding";"UTF8"
"shared_buffers";"96MB"
"ssl";"on"
"synchronous_standby_names";"*"
"TimeZone";"localtime"
"unix_socket_directory";"/var/run/postgresql"
"wal_buffers";"3MB"
"wal_keep_segments";"32"
"wal_level";"hot_standby"

== Slave configuration ==
"version";"PostgreSQL 9.1beta2 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Debian 4.6.0-12) 4.6.1 20110608 (prerelease), 64-bit"
"bytea_output";"escape"
"client_encoding";"UNICODE"
"external_pid_file";"/var/run/postgresql/9.1-sr-slave0.pid"
"hot_standby";"on"
"lc_collate";"en_US.UTF-8"
"lc_ctype";"en_US.UTF-8"
"listen_addresses";"*"
"log_line_prefix";"%t "
"max_connections";"100"
"max_stack_depth";"2MB"
"port";"5434"
"server_encoding";"UTF8"
"shared_buffers";"96MB"
"ssl";"on"
"TimeZone";"localtime"
"unix_socket_directory";"/var/run/postgresql"
"wal_buffers";"3MB"


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-06 23:07:09
Message-ID: CA+U5nMKqSp50bF59dsKhYHsC9bTZtMJ8BhEgNfRbxpq_XjPaSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld
<David(dot)Hartveld(at)mendix(dot)com> wrote:

> Is there possibly a known issue with the beta, or do I have to configure my
> cluster differently for 9.1?

Thanks for trying 9.1beta

No known bugs, no differences in configuration.

You haven't enabled any of the new 9.1 features either so they aren't
likely to be at issue.

So there's something pretty badly screwed up somewhere, though that
looks like pilot error at the moment, sorry.

I'd suggest starting again and see if you can get a reproduceable bug.
I'd be very grateful if you can narrow things down to produce a tight
bug report.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 12:12:29
Message-ID: 0317654684C3CF48B06D8FF5AE5D2EE0D257@Win-Exchange-02.MENDIXDOMAIN.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> -----Oorspronkelijk bericht-----
> Van: Simon Riggs [mailto:simon(at)2ndQuadrant(dot)com]
> Verzonden: donderdag 7 juli 2011 01:07
> Aan: David Hartveld
> CC: pgsql-general(at)postgresql(dot)org
> Onderwerp: Re: [GENERAL] Streaming replication on 9.1-beta2 after pg_restore
> is very slow
>
> On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
> wrote:
>
> > Is there possibly a known issue with the beta, or do I have to
> > configure my cluster differently for 9.1?
>
> Thanks for trying 9.1beta
>
> No known bugs, no differences in configuration.
>
> You haven't enabled any of the new 9.1 features either so they aren't likely to be
> at issue.
>
> So there's something pretty badly screwed up somewhere, though that looks like
> pilot error at the moment, sorry.
>
> I'd suggest starting again and see if you can get a reproduceable bug.
> I'd be very grateful if you can narrow things down to produce a tight bug report.

I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is reproducibly very slow. If you need more information, let me know.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 12:27:43
Message-ID: CA+U5nMJebiNrFULkjhdOoGS24QpF8rXmi5-cfhaoL_9aYrfrKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Jul 7, 2011 at 1:12 PM, David Hartveld
<David(dot)Hartveld(at)mendix(dot)com> wrote:

>>
>> On Wed, Jul 6, 2011 at 4:54 PM, David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
>> wrote:
>>
>> > Is there possibly a known issue with the beta, or do I have to
>> > configure my cluster differently for 9.1?
>>
>> Thanks for trying 9.1beta
>>
>> No known bugs, no differences in configuration.
>>
>> You haven't enabled any of the new 9.1 features either so they aren't likely to be
>> at issue.
>>
>> So there's something pretty badly screwed up somewhere, though that looks like
>> pilot error at the moment, sorry.
>>
>> I'd suggest starting again and see if you can get a reproduceable bug.
>> I'd be very grateful if you can narrow things down to produce a tight bug report.
>
> I've just submitted bug report 6094, with a complete description of what I have done. The replication stream is reproducibly very slow. If you need more information, let me know.

Bug 6094 contains no additional information and there is not yet a
confirmed bug.

Your output indicates that there is a problem in your replication
setup and this is why the slave does not catch up.

This is not a performance issue. It is either a bug in replication, or
a user configuration issue. Since few things have changed in 9.1 in
this area, at the moment the balance of probability is towards user
error. If you can provide a more isolated bug report we may be able to
investigate.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 13:59:08
Message-ID: 0317654684C3CF48B06D8FF5AE5D2EE0D2B8@Win-Exchange-02.MENDIXDOMAIN.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > Your output indicates that there is a problem in your replication
> > setup and this is why the slave does not catch up.
> >
> > This is not a performance issue. It is either a bug in replication, or
> > a user configuration issue. Since few things have changed in 9.1 in
> > this area, at the moment the balance of probability is towards user
> > error. If you can provide a more isolated bug report we may be able to
> investigate.
>
> Apologies for the double post, I thought to have understood that in your
> previous message.
>
> I've read the online 9.1 manual and configured the clusters based on that
> information (and on the defaults provided by debian). I've attached the
> postgresql.conf files I'm using for master and slave. Do you need other
> information from my final setup? Log files, configuration files, the SQL script fed
> to psql, which shows the slow replication...?

I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can see that somehow the master and slave don't properly work together: the slave attempts to send some data ('sending write/flush/apply') (I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to administrator command', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then 'could not send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this provides you with more information on what is going on. Do point me in the right direction if you need me to investigate further. I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction, where you can see the above behavior.

Hoping that this will bring me a bit closer to a solution or a proper bug report,
David Hartveld


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 14:21:11
Message-ID: CA+U5nMJD4LS2=orsOOgU+sKapTmn3-_fmox_u5fjvnc7JjAUeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Jul 7, 2011 at 2:59 PM, David Hartveld
<David(dot)Hartveld(at)mendix(dot)com> wrote:

> I've been looking at my log files on master and slave a bit better, after having set log_min_messages = debug5. I can see that somehow the master and slave don't properly work together: the slave attempts to send some data ('sending write/flush/apply') (I'm assuming this is the slaves current location in the WAL?) and then 'terminates process due to administrator command', while the master is sending data ('write/flush/apply') (the next part of the WAL?), and then 'could not send data to the client: Connection reset by peer', after which the server process exits. I'm hoping this provides you with more information on what is going on. Do point me in the right direction if you need me to investigate further. I have attached two pieces of the master and slave log files, which should correspond w.r.t. their interaction, where you can see the above behavior.

Ah, so synchronous_standby_names is set on the standby.

Please reset that so we are operating asynchronously, then rerun tests
to see if that avoids the error. You'll probably need to fully
re-generate the standby server before doing this.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Hartveld <david(dot)hartveld(at)mendix(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 14:45:06
Message-ID: CA+U5nMKMLRL9Pcy9Eg9NyBnGOUVeKLdwYvhQMO5oEECRj+V08Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Jul 7, 2011 at 3:37 PM, David Hartveld
<david(dot)hartveld(at)mendix(dot)com> wrote:

> Other suggestions?

If speed is your concern, a little performance tuning might help,
judging from your configs.
http://www.2ndQuadrant.com/books/ or other sources will help.

I'm interested in seeing some ERROR messages from either system,
showing their sequence between master/standby. At the moment its not
clear what the first error is. Subsequent messages are less
interesting.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Hartveld <David(dot)Hartveld(at)mendix(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 15:14:05
Message-ID: 14178.1310051645@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Looking back, I notice that you built with gcc 4.6.0. At least on Red
Hat machines, that gcc has a rather nasty optimization bug that breaks
WAL replay, with symptoms that seem to match what you have here ---
namely, the replay process quits and has to be restarted every few pages.
I'm betting Debian hasn't fixed that bug yet either and so you need this
post-beta2 patch:

http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

regards, tom lane


From: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 15:29:24
Message-ID: 20110707152924.GN3513@hermes.hilbert.loc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:

> Looking back, I notice that you built with gcc 4.6.0. At least on Red
> Hat machines, that gcc has a rather nasty optimization bug that breaks
> WAL replay, with symptoms that seem to match what you have here ---
> namely, the replay process quits and has to be restarted every few pages.
> I'm betting Debian hasn't fixed that bug yet either and so you need this
> post-beta2 patch:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

Do they know about this gcc bug ?

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-07 16:23:37
Message-ID: 15504.1310055817@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net> writes:
> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>> I'm betting Debian hasn't fixed that bug yet either and so you need this
>> post-beta2 patch:
>>
>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a

> Do they know about this gcc bug ?

Can't say about Debian in particular, but upstream gcc certainly knows
about it.
https://bugzilla.redhat.com/show_bug.cgi?id=712480
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390

regards, tom lane


From: David Hartveld <david(dot)hartveld(at)mendix(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>, <pgsql-general(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-08 08:19:15
Message-ID: 4E16BD83.9080008@mendix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Op 07-07-11 18:23, Tom Lane schreef:
> Karsten Hilbert<Karsten(dot)Hilbert(at)gmx(dot)net> writes:
>> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>>> I'm betting Debian hasn't fixed that bug yet either and so you need this
>>> post-beta2 patch:
>>>
>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a
>
>> Do they know about this gcc bug ?
>
> Can't say about Debian in particular, but upstream gcc certainly knows
> about it.
> https://bugzilla.redhat.com/show_bug.cgi?id=712480
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390

I'm guessing this is probably the cause of my problem, then. I'll try to
rebuild with an earlier version of GCC, and let you know the results.

Thanks to all of you!
David


From: David Hartveld <david(dot)hartveld(at)mendix(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>, <pgsql-general(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Subject: Re: Streaming replication on 9.1-beta2 after pg_restore is very slow
Date: 2011-07-08 10:26:06
Message-ID: 4E16DB3E.302@mendix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Op 08-07-11 10:19, David Hartveld schreef:
> Op 07-07-11 18:23, Tom Lane schreef:
>> Karsten Hilbert<Karsten(dot)Hilbert(at)gmx(dot)net> writes:
>>> On Thu, Jul 07, 2011 at 11:14:05AM -0400, Tom Lane wrote:
>>>> I'm betting Debian hasn't fixed that bug yet either and so you need
>>>> this
>>>> post-beta2 patch:
>>>>
>>>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c2ba0121c73b7461331104a46d140156e847572a
>>>>
>>
>>> Do they know about this gcc bug ?
>>
>> Can't say about Debian in particular, but upstream gcc certainly knows
>> about it.
>> https://bugzilla.redhat.com/show_bug.cgi?id=712480
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49390
>
> I'm guessing this is probably the cause of my problem, then. I'll try to
> rebuild with an earlier version of GCC, and let you know the results.

I've rebuilt with debian gcc 4.4.5-8 and am running 9.1 now without any
of the above problems. I've reported a bug for debian postgresql-9.1.
I've also seen that the current gcc-4.6 version for debian is 4.6.1, so
that shouldn't create any problems anymore (or at least the gcc bug is
marked as applied in 4.6.1 in the gcc bugtracker).

Thanks for all your time, guys!
Greetings, David