Quick Links

Streaming Replication patch for CommitFest 2009-09

Lists:	pgsql-hackers

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 11:24:45
Message-ID:	3f0b79eb0909140424q6bb8e6a3ka63b5816bb1d3c45@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Here is the latest version of Streaming Replication (SR) patch.

There were four major problems in the SR patch which was submitted for
the last CommitFest. The latest patch has overcome those problems:

> 1. Change the way synchronization is done when standby connects to
> primary. After authentication, standby should send a message to primary,
> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
> segment name). Primary starts streaming WAL starting from that point,
> and keeps streaming forever. pg_read_xlogfile() needs to be removed.

In the latest version, at first, the standby attempts to do an archive recovery
as long as there is WAL record available in pg_xlog or archival area (only
possible if restore_command is supplied). When it finds the recovery error
(e.g., there is no WAL file available), it starts walreceiver process, and
requests the primary server to ship the WAL records following the last applied
record. Then the primary continuously sends the WAL records. OTOH, the
standby continuously receives, writes and replays them.

> 2. The primary should have no business reading back from the archive.
> The standby can read from the archive, as it can today.

I got rid of the capability to restore the archived file, from the
primary. Also in
order not to lose the WAL file (required for the standby) from pg_xlog before
sending it, I tweaked the recycling policy of checkpoint.

> 3. Need to support multiple WALSenders. While multiple slave support
> isn't 1st priority right now, it's not acceptable that a new WALSender
> can't connect while one is active already. That can cause trouble in
> case of network problems etc.

In the latest version, more than one standbys can establish a connection to
the primary. The WAL is concurrently shipped to those standbys, respectively.
The maximum number of standbys can be specified as a GUC variable
(max_wal_senders: better name?).

> 4. It is not acceptable that normal backends have to wait for walsender
> to send data. That means that connecting a standby behind a slow
> connection to the primary can grind the primary to a halt. walsender
> needs to be able to read data from disk, not just from shared memory. (I
> raised this back in December
> http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

In the latest version, the walsender reads the WAL records from disk
instead of wal_buffers. So when the backend attempts to delete old data
from wal_buffer to insert new one, it doesn't need to wait until walsender
has read that data from wal_buffers.

> As a hint, I think you'll find it a lot easier if you implement only
> asynchronous replication at first. That reduces the amount of
> inter-process communication a lot. You can then add synchronous
> capability in a later commitfest. I would also suggest that for point 4,
> you implement WAL sender so that it *only* reads from disk at first, and
> only add the capability send from wal_buffers later on, and only if
> performance testing shows that it's needed.

I advance development of SR in stages as Heikki suggested.
So note that the current patch provides only core part of *asynchronous*
log-shipping. There are many TODO items for later CommitFests:
synchronous capability, more useful statistics for SR, some feature for
admin, and so on.

The attached tarball contains some files. Description of each files,
a brief procedure to set up SR and the functional overview of it are in wiki.
And, I'm going to add the description of design of SR into wiki as much
as possible.
http://wiki.postgresql.org/wiki/Streaming_Replication

If you notice anything, please feel free to comment!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
SR_0914.tgz	application/x-gzip	137.4 KB

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 15:47:28
Message-ID:	alpine.GSO.2.01.0909141133180.1786@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is looking really neat now, making async replication really solid
first before even trying to move on to sync is the right way to go here
IMHO. I just cleaned up the docs on the Wiki page, when this patch is
closer to being committed I officially volunteer to do the same on the
internal SGML docs; someone should nudge me when the patch is at that
point if I don't take care of it before then.

Putting on my DBA hat for a minute, the first question I see people asking
is "how do I measure how far behind the slaves are?". Presumably you can
get that out of pg_controldata; my first question is whether that's
complete enough information? If not, what else should be monitored?

I don't think running that program going to fly for a production quality
integrated replication setup though. The UI admins are going to want
would allow querying this easily via a standard database query. Most
monitoring systems can issue psql queries but not necessarily run a remote
binary. I think that parts of pg_controldata needs to get exposed via
some number of built-in UDFs instead, and whatever new internal state
makes sense too. I could help out writing those, if someone more familiar
with the replication internals can help me nail down a spec on what to
watch.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 16:05:24
Message-ID:	4AAE69C4.9030407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith wrote:
> Putting on my DBA hat for a minute, the first question I see people
> asking is "how do I measure how far behind the slaves are?". Presumably
> you can get that out of pg_controldata; my first question is whether
> that's complete enough information? If not, what else should be monitored?
>
> I don't think running that program going to fly for a production quality
> integrated replication setup though. The UI admins are going to want
> would allow querying this easily via a standard database query. Most
> monitoring systems can issue psql queries but not necessarily run a
> remote binary. I think that parts of pg_controldata needs to get
> exposed via some number of built-in UDFs instead, and whatever new
> internal state makes sense too. I could help out writing those, if
> someone more familiar with the replication internals can help me nail
> down a spec on what to watch.

Yep, assuming for a moment that hot standby goes into 8.5, status
functions that return such information is the natural interface. It
should be trivial to write them as soon as hot standby and streaming
replication are in place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 16:06:30
Message-ID:	4AAE6A06.2070505@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith wrote:
> This is looking really neat now, making async replication really solid
> first before even trying to move on to sync is the right way to go
> here IMHO.

I agree with both of those sentiments.

One question I have is what is the level of traffic involved between the
master and the slave. I know numbers of people have found the traffic
involved in shipping of log files to be a pain, and thus we get things
like pglesslog.

cheers

andrew

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 17:12:38
Message-ID:	4AAE3336020000250002AF06@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> Putting on my DBA hat for a minute, the first question I see people
> asking is "how do I measure how far behind the slaves are?".
> Presumably you can get that out of pg_controldata; my first question
> is whether that's complete enough information? If not, what else
> should be monitored?
>
> I don't think running that program going to fly for a production
> quality integrated replication setup though. The UI admins are
> going to want would allow querying this easily via a standard
> database query. Most monitoring systems can issue psql queries but
> not necessarily run a remote binary. I think that parts of
> pg_controldata needs to get exposed via some number of built-in UDFs
> instead, and whatever new internal state makes sense too. I could
> help out writing those, if someone more familiar with the
> replication internals can help me nail down a spec on what to watch.

IMO, it would be best if the status could be sent via NOTIFY. In my
experience, this results in monitoring which both has less overhead
and is more current. We tend to be almost as interested in metrics on
throughput as lag. Backlogged volume can be interesting, too, if it's
available.

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 17:36:55
Message-ID:	4AAE7F37.4030402@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kevin Grittner wrote:
> Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
>> I don't think running that program going to fly for a production
>> quality integrated replication setup though. The UI admins are
>> going to want would allow querying this easily via a standard
>> database query. Most monitoring systems can issue psql queries but
>> not necessarily run a remote binary. I think that parts of
>> pg_controldata needs to get exposed via some number of built-in UDFs
>> instead, and whatever new internal state makes sense too. I could
>> help out writing those, if someone more familiar with the
>> replication internals can help me nail down a spec on what to watch.
>
> IMO, it would be best if the status could be sent via NOTIFY.

To where?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	"Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 17:54:38
Message-ID:	4AAE3D0E020000250002AF11@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Kevin Grittner wrote:

>> IMO, it would be best if the status could be sent via NOTIFY.
>
> To where?

To registered listeners?

I guess I should have worded that as "it would be best if a change is
replication status could be signaled via NOTIFY" -- does that satisfy,
or am I missing your point entirely?

-Kevin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 17:54:49
Message-ID:	4AAE8369.6000303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> Here is the latest version of Streaming Replication (SR) patch.

The first thing that caught my eye is that I don't think "replication"
should be a real database. Rather, it should by a keyword in
pg_hba.conf, like the existing "all", "sameuser", "samerole" keywords
that you can put into the database-column.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-14 19:41:44
Message-ID:	1252957305.431.111.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-09-14 at 20:24 +0900, Fujii Masao wrote:

> The latest patch has overcome those problems:

Well done. I hope to look at it myself in a few days time.

--
Simon Riggs www.2ndQuadrant.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	gsmith(at)gregsmith(dot)com
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-15 04:24:22
Message-ID:	3f0b79eb0909142124r332e0013i60d3b114846ed450@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Sep 15, 2009 at 12:47 AM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> Putting on my DBA hat for a minute, the first question I see people asking
> is "how do I measure how far behind the slaves are?". Presumably you can
> get that out of pg_controldata; my first question is whether that's complete
> enough information? If not, what else should be monitored?

Currently the progress of replication is shown only in PS display. So, the
following three steps are necessary to measure the gap of the servers.

1. execute pg_current_xlog_location() to check how far the primary has
written WAL.
2. execute 'ps' to check how far the standby has written WAL.
3. compare the above results.

This is very messy. More user-friendly monitoring feature is necessary,
and development of it is one of TODO item for the later CommitFest.

I'm thinking something like pg_standbys_xlog_location() which returns
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary. Thought?

But the problem might be what happens after the primary has fallen
down. The current write location of the primary cannot be checked via
pg_current_xlog_locaton, and might need to be calculated from WAL
files on the primary. Is the tool which performs such calculation
necessary?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-15 04:54:10
Message-ID:	3f0b79eb0909142154qaaea56di9e28eeca00bef99@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Sep 15, 2009 at 1:06 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> One question I have is what is the level of traffic involved between the
> master and the slave. I know numbers of people have found the traffic
> involved in shipping of log files to be a pain, and thus we get things like
> pglesslog.

That is almost the same as the WAL write traffic on the primary. In fact,
the content of WAL files written to the standby are exactly the same as
those on the primary. Currently SR has provided no compression
capability of the traffic. Should we introduce something like
walsender_hook/walreceiver_hook to cooperate with the add-on program
for compression like pglesslog?

If you always use PITR instead of normal recovery, full_page_writes = off
might be another solution.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-15 06:32:48
Message-ID:	3f0b79eb0909142332o199f5db3w8af2d10d3cd7e22a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Sep 15, 2009 at 2:54 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> The first thing that caught my eye is that I don't think "replication"
> should be a real database. Rather, it should by a keyword in
> pg_hba.conf, like the existing "all", "sameuser", "samerole" keywords
> that you can put into the database-column.

I'll try that! It might be only necessary to prevent walsender from accessing
pg_database and checking if the target database is present, in InitPostres().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-15 10:49:02
Message-ID:	4AAF711E.5060904@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kevin Grittner wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Kevin Grittner wrote:
>
>>> IMO, it would be best if the status could be sent via NOTIFY.
>> To where?
>
> To registered listeners?
>
> I guess I should have worded that as "it would be best if a change is
> replication status could be signaled via NOTIFY" -- does that satisfy,
> or am I missing your point entirely?

Ok, makes more sense now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-15 10:53:52
Message-ID:	4AAF7240.6050505@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-16 02:37:20
Message-ID:	3f0b79eb0909151937v3884fc7ar2d6766a43d354c0a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> After playing with this a little bit, I think we need logic in the slave
> to reconnect to the master if the connection is broken for some reason,
> or can't be established in the first place. At the moment, that is
> considered as the end of recovery, and the slave starts up. You have the
> trigger file mechanism to stop that, but it only gives you a chance to
> manually kill and restart the slave before it chooses a new timeline and
> starts up, it doesn't reconnect automatically.

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Also a parameter like retries_interval might be necessary. This parameter
indicates the interval between each reconnection attempt.

Do you think that these parameters should be introduced right now? or
the later CF?

BTW, these parameters are provided in MySQL replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-16 10:41:15
Message-ID:	3f0b79eb0909160341n330693bdv43262d3115af538a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Sep 16, 2009 at 11:37 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> I was thinking that the automatic reconnection capability is the TODO item
> for the later CF. The infrastructure for it has already been introduced in the
> current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
> postmaster/walreceiver.c). This is the maximum number of times to retry
> walreceiver. In the current version, this is the fixed value, but we can make
> this user-configurable (parameter of recovery.conf is suitable, I think).
>
> Also a parameter like retries_interval might be necessary. This parameter
> indicates the interval between each reconnection attempt.
>
> Do you think that these parameters should be introduced right now? or
> the later CF?

I updated the TODO list on the wiki, and marked the items that I'm going to
develop for the later CommitFest.
http://wiki.postgresql.org/wiki/Streaming_Replication#Todo_and_Claim

Do you have any other TODO item? How much is that priority?
And, is there already-listed TODO item which should be developed right
now (CommitFest 2009-09)?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-17 08:08:06
Message-ID:	4AB1EE66.6040804@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> After playing with this a little bit, I think we need logic in the slave
>> to reconnect to the master if the connection is broken for some reason,
>> or can't be established in the first place. At the moment, that is
>> considered as the end of recovery, and the slave starts up. You have the
>> trigger file mechanism to stop that, but it only gives you a chance to
>> manually kill and restart the slave before it chooses a new timeline and
>> starts up, it doesn't reconnect automatically.
>
> I was thinking that the automatic reconnection capability is the TODO item
> for the later CF. The infrastructure for it has already been introduced in the
> current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
> postmaster/walreceiver.c). This is the maximum number of times to retry
> walreceiver. In the current version, this is the fixed value, but we can make
> this user-configurable (parameter of recovery.conf is suitable, I think).

Ah, I see.

Robert Haas suggested a while ago that walreceiver could be a
stand-alone utility, not requiring postmaster at all. That would allow
you to set up streaming replication as another way to implement WAL
archiving. Looking at how the processes interact, there really isn't
much communication between walreceiver and the rest of the system, so
that sounds pretty attractive.

Walreceiver only needs access to shared memory so that it can tell the
startup process how far it has replicated already. Even when we add the
synchronous capability, I don't think we need any more inter-process
communication. Only if we wanted to acknowledge to the master when a
piece of WAL log has been successfully replayed, the startup process
would need to tell walreceiver about it, but I think we're going to
settle for acknowledging when a piece of log has been fsync'd to disk.

Walreceiver is really a slave to the startup process. The startup
process decides when it's launched, and it's the startup process that
then waits for it to advance. But the way it's set up at the moment, the
startup process needs to ask the postmaster to start it up, and it
doesn't look very robust to me. For example, if launching walreceiver
fails for some reason, startup process will just hang waiting for it.

I'm thinking that walreceiver should be a stand-alone program that the
startup process launches, similar to how it invokes restore_command in
PITR recovery. Instead of using system(), though, it would use
fork+exec, and a pipe to communicate.

Also, when we get around to implement the "fetch base backup
automatically via the TCP connection" feature, we can't use walreceiver
as it is now for that, because there's no hope of starting up the system
that far without a base backup. I'm not sure if it can or should be
merged with the walreceiver program, but it can't be a postmaster child
process, that's for sure.

Thoughts?

> Also a parameter like retries_interval might be necessary. This parameter
> indicates the interval between each reconnection attempt.

Yeah, maybe, although a hard-coded interval of a few seconds should be
enough to get us started.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-17 08:46:48
Message-ID:	9837222c0909170146g7721af7fte033c4a08349f407@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 17, 2009 at 10:08, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Fujii Masao wrote:
>> On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> After playing with this a little bit, I think we need logic in the slave
>>> to reconnect to the master if the connection is broken for some reason,
>>> or can't be established in the first place. At the moment, that is
>>> considered as the end of recovery, and the slave starts up. You have the
>>> trigger file mechanism to stop that, but it only gives you a chance to
>>> manually kill and restart the slave before it chooses a new timeline and
>>> starts up, it doesn't reconnect automatically.
>>
>> I was thinking that the automatic reconnection capability is the TODO item
>> for the later CF. The infrastructure for it has already been introduced in the
>> current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
>> postmaster/walreceiver.c). This is the maximum number of times to retry
>> walreceiver. In the current version, this is the fixed value, but we can make
>> this user-configurable (parameter of recovery.conf is suitable, I think).
>
> Ah, I see.
>
> Robert Haas suggested a while ago that walreceiver could be a
> stand-alone utility, not requiring postmaster at all. That would allow
> you to set up streaming replication as another way to implement WAL
> archiving. Looking at how the processes interact, there really isn't
> much communication between walreceiver and the rest of the system, so
> that sounds pretty attractive.

Yes, that would be very very useful.

> Walreceiver is really a slave to the startup process. The startup
> process decides when it's launched, and it's the startup process that
> then waits for it to advance. But the way it's set up at the moment, the
> startup process needs to ask the postmaster to start it up, and it
> doesn't look very robust to me. For example, if launching walreceiver
> fails for some reason, startup process will just hang waiting for it.
>
> I'm thinking that walreceiver should be a stand-alone program that the
> startup process launches, similar to how it invokes restore_command in
> PITR recovery. Instead of using system(), though, it would use
> fork+exec, and a pipe to communicate.

Not having looked at all into the details, that sounds like a nice
improvement :-)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-17 11:22:01
Message-ID:	1253186521.3295.85.camel@pcd12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-09-17 at 10:08 +0200, Heikki Linnakangas wrote:
> Robert Haas suggested a while ago that walreceiver could be a
> stand-alone utility, not requiring postmaster at all. That would allow
> you to set up streaming replication as another way to implement WAL
> archiving. Looking at how the processes interact, there really isn't
> much communication between walreceiver and the rest of the system, so
> that sounds pretty attractive.

Just a small comment in this direction: what if the archive would be
itself a postgres DB, and it would collect the WALs in some special
place (together with some meta data, snapshots, etc), and then a slave
could connect to it just like to any other master ? (except maybe it
could specify which snapshot to to start with and possibly choosing
between different archived WAL streams).

Maybe it is completely stupid what I'm saying, but I see the archive as
just another form of a postgres server, with the same protocol from the
POV of a slave. While I don't have the clue to implement such a thing, I
thought it might be interesting as an idea while discussing the
walsender/receiver interface...

Cheers,
Csaba.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-17 11:32:59
Message-ID:	4AB21E6B.1020602@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Some random comments:

I don't think we need the new PM_SHUTDOWN_3 postmaster state. We can
treat walsenders the same as the archive process, and kill and wait for
both of them to die in PM_SHUTDOWN_2 state.

I think there's something wrong with the napping in walsender. When I
perform px_xlog_switch(), it takes surprisingly long for it to trickle
to the standby. When I put a little proxy program in between the master
and slave that delays all messages from the slave to the master by one
second, it got worse, even though I would expect the master to still
keep sending WAL at full speed. I get logs like this:

2009-09-17 14:13:16.876 EEST LOG: xlog send request 0/38000000; send
0/3700006C; write 0/3700006C
2009-09-17 14:13:16.877 EEST LOG: xlog read request 0/37010000; send
0/37010000; write 0/3700006C
2009-09-17 14:13:17.077 EEST LOG: xlog send request 0/38000000; send
0/37010000; write 0/3700006C
2009-09-17 14:13:17.077 EEST LOG: xlog read request 0/37020000; send
0/37020000; write 0/3700006C
2009-09-17 14:13:17.078 EEST LOG: xlog read request 0/37030000; send
0/37030000; write 0/3700006C
2009-09-17 14:13:17.278 EEST LOG: xlog send request 0/38000000; send
0/37030000; write 0/3700006C
2009-09-17 14:13:17.279 EEST LOG: xlog read request 0/37040000; send
0/37040000; write 0/3700006C
...
2009-09-17 14:13:22.796 EEST LOG: xlog read request 0/37FD0000; send
0/37FD0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog send request 0/38000000; send
0/37FD0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog read request 0/37FE0000; send
0/37FE0000; write 0/376D0000
2009-09-17 14:13:22.896 EEST LOG: xlog read request 0/37FF0000; send
0/37FF0000; write 0/376D0000
2009-09-17 14:13:22.897 EEST LOG: xlog read request 0/38000000; send
0/38000000; write 0/376D0000
2009-09-17 14:14:09.932 EEST LOG: xlog send request 0/38000428; send
0/38000000; write 0/38000000
2009-09-17 14:14:09.932 EEST LOG: xlog read request 0/38000428; send
0/38000428; write 0/38000000

It looks like it's having 100 or 200 ms naps in between. Also, I
wouldn't expect to see so many "read request" acknowledgments from the
slave. The master doesn't really need to know how far the slave is,
except in synchronous replication when it has requested a flush to
slave. Another reason why master needs to know is so that the master can
recycle old log files, but for that we'd really only need an
acknowledgment once per WAL file or even less.

Why does XLogSend() care about page boundaries? Perhaps it's a leftover
from the old approach that read from wal_buffers?

Do we really need the support for asynchronous backend libpq commands?
Could walsender just keep blasting WAL to the slave, and only try to
read an acknowledgment after it has requested one, by setting
XLOGSTREAM_FLUSH flag. Or maybe we should be putting the socket into
non-blocking mode.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 05:47:24
Message-ID:	4AB31EEC.4000509@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> I'm thinking that walreceiver should be a stand-alone program that the
> startup process launches, similar to how it invokes restore_command in
> PITR recovery. Instead of using system(), though, it would use
> fork+exec, and a pipe to communicate.

Here's a WIP patch to do that, over your latest posted patch. I've also
pushed this to my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, "replication"
branch.

I'll continue reviewing...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
replication-standalone-walreceiver.patch	text/x-diff	66.7 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 05:50:06
Message-ID:	3f0b79eb0909172250m71c942f8n820c94bc8a264176@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Sep 17, 2009 at 8:32 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Some random comments:

Thanks for the comments.

> I don't think we need the new PM_SHUTDOWN_3 postmaster state. We can
> treat walsenders the same as the archive process, and kill and wait for
> both of them to die in PM_SHUTDOWN_2 state.

OK, I'll use PM_SHUTDOWN_2 for walsender instead of PM_SHUTDOWN_3.

> I think there's something wrong with the napping in walsender. When I
> perform px_xlog_switch(), it takes surprisingly long for it to trickle
> to the standby. When I put a little proxy program in between the master
> and slave that delays all messages from the slave to the master by one
> second, it got worse, even though I would expect the master to still
> keep sending WAL at full speed. I get logs like this:

Probably this is because XLOG records following XLOG_SWITCH are
sent to the standby, too. Though those records are obviously not used
for recovery, they are sent because walsender doesn't know where
XLOG_SWITCH is.

The difficulty is that there might be many XLOG_SWITCHs in the XLOG
files which are going to be sent by walsender. How should walsender
get to know those location? One possible solution is to make walsender
parse the XLOG files and search XLOG_SWITCH. But this is overkill,
I think.

I don't think that XLOG switch is often requested and is sensitive to
response time in many cases. So it's not worth changing walsender
to skip the XLOG following XLOG_SWITCH, I think. Thought?

> 2009-09-17 14:14:09.932 EEST LOG: xlog send request 0/38000428; send
> 0/38000000; write 0/38000000
> 2009-09-17 14:14:09.932 EEST LOG: xlog read request 0/38000428; send
> 0/38000428; write 0/38000000
>
> It looks like it's having 100 or 200 ms naps in between. Also, I
> wouldn't expect to see so many "read request" acknowledgments from the
> slave. The master doesn't really need to know how far the slave is,
> except in synchronous replication when it has requested a flush to
> slave. Another reason why master needs to know is so that the master can
> recycle old log files, but for that we'd really only need an
> acknowledgment once per WAL file or even less.

You mean that the new protocol for asking the standby about the completion
location of replication is required? In synchronous case, the backend should
not wait for one acknowledgement per XLOG file, for its performance.

> Why does XLogSend() care about page boundaries? Perhaps it's a leftover
> from the old approach that read from wal_buffers?

That is for not sending a partially-filled XLOG *record*, which simplifies the
logic that startup process waits for the next XLOG record available, i.e.,
startup process doesn't need to take care of a partially-sent record.

> Do we really need the support for asynchronous backend libpq commands?
> Could walsender just keep blasting WAL to the slave, and only try to
> read an acknowledgment after it has requested one, by setting
> XLOGSTREAM_FLUSH flag. Or maybe we should be putting the socket into
> non-blocking mode.

Yes, that is required, especially for synchronous replication. The receiving of
the acknowledgement should not keep the subsequent XLOG-sending waiting.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 07:11:18
Message-ID:	4AB33296.5050902@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> I'm thinking that walreceiver should be a stand-alone program that the
>> startup process launches, similar to how it invokes restore_command in
>> PITR recovery. Instead of using system(), though, it would use
>> fork+exec, and a pipe to communicate.
>
> Here's a WIP patch to do that, over your latest posted patch. I've also
> pushed this to my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, "replication"
> branch.
>
> I'll continue reviewing...

BTW, my modified patch doesn't correctly zero-fill new WAL segments.
Needs to be fixed...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 09:51:51
Message-ID:	3f0b79eb0909180251v5ad09c82y35a4c8f26297b0f8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Sep 18, 2009 at 2:47 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Heikki Linnakangas wrote:
>> I'm thinking that walreceiver should be a stand-alone program that the
>> startup process launches, similar to how it invokes restore_command in
>> PITR recovery. Instead of using system(), though, it would use
>> fork+exec, and a pipe to communicate.
>
> Here's a WIP patch to do that, over your latest posted patch. I've also
> pushed this to my git repository at
> git://git.postgresql.org/git/users/heikki/postgres.git, "replication"
> branch.

In my environment, I cannot use git protocol for some reason.
Could you export your repository so that it can be accessed also via http?
BTW, I seem to be able to access http://git.postgresql.org/git/bucardo.git.
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#exporting-via-http

How should we advance development of SR?
Should I be concentrated on the primary side, and leave the standby side to you?
When I change something, should I make a patch for the latest SR source in your
git repo, and submit it?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 10:09:00
Message-ID:	4AB35C3C.9010301@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> Hi,
>
> On Fri, Sep 18, 2009 at 2:47 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Heikki Linnakangas wrote:
>>> I'm thinking that walreceiver should be a stand-alone program that the
>>> startup process launches, similar to how it invokes restore_command in
>>> PITR recovery. Instead of using system(), though, it would use
>>> fork+exec, and a pipe to communicate.
>> Here's a WIP patch to do that, over your latest posted patch. I've also
>> pushed this to my git repository at
>> git://git.postgresql.org/git/users/heikki/postgres.git, "replication"
>> branch.
>
> In my environment, I cannot use git protocol for some reason.
> Could you export your repository so that it can be accessed also via http?

Sure, it should be accessible via HTTP as well:
http://git.postgresql.org/git/users/heikki/postgres.git

> How should we advance development of SR?
> Should I be concentrated on the primary side, and leave the standby side to you?
> When I change something, should I make a patch for the latest SR source in your
> git repo, and submit it?

Hmm, yeah, let's do that.

Right now, I'm trying to understand the page boundary stuff and partial
page handling in ReadRecord and walsender.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-18 10:34:07
Message-ID:	3f0b79eb0909180334j65c0902ved40e2484854e2f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I'm thinking that walreceiver should be a stand-alone program that the
> startup process launches, similar to how it invokes restore_command in
> PITR recovery. Instead of using system(), though, it would use
> fork+exec, and a pipe to communicate.

This approach is OK if the stand-alone walreceiver is treated steadily
by the startup process like a child process under postmaster:

* Handling of some interrupts: SIGHUP, SIGTERM?, SIGINT, SIGQUIT...
For example, the startup process would need to rethrow walreceiver
the interrupt from postmaster.

* Communication with other child processes: stats collector? syslogger?...
For example, the log message generated by walreceiver should also
be collected by syslogger if requested.

For now, I think that pipe is enough for communication between the
startup process and walreceiver. Though there was the idea to pass
XLOG to the startup process via wal_buffers, in which pipe is not
suitable, I think that is overkill.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-21 07:51:53
Message-ID:	4AB73099.1060202@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Having gone through the patch now in more detail, I think it's in pretty
good shape. I'm happy with the overall design, except that I haven't
been able to make up my mind if walreceiver should indeed be a
stand-alone program as discussed, or a postmaster child process as in
the patch you submitted. Putting that question aside for a moment,
here's some minor things, in no particular order:

- The async API in PQgetXLogData is quite different from the other
commands. It's close to the API from PQgetCopyData(), but doesn't return
a malloc'd buffer like PQgetCopyData does. I presume that's to optimize
away the extra memcpy step? I don't think that's really necessary, I
don't recall any complaints about that in PQgetCopyData(), and if it
does become an issue, it could be optimized away by mallocing the buffer
first and reading directly to that.

- Can we avoid sprinkling XLogStreamingAllowed() calls to places where
we check if WAL-logging is required (nbtsort.c, copy.c etc.). I think we
need a new macro to encapsulate (XLogArchivingActive() ||
XLogStreamingAllowed()).

- Is O_DIRECT ever a good idea in walreceiver? If it's really direct and
doesn't get cached, the startup process will need to read from disk.

- Can we replace read/write_conninfo with just a long-enough field in
shared mem? Would be simpler. (this is moot if we go with the
stand-alone walreceiver program and pass it as a command-line argument)

- walreceiver shouldn't die on connection error, just to be restarted by
startup process. Can we add error handling a la bgwriter and have a
retry loop within walreceiver? (again, if we go with a stand-alone
walreceiver program, it's probably better to have startup process
responsible to restart walreceiver, as it is now)

- pq_wait in backend waits until you can read or write at least 1 byte.
There is no guarantee that you can send or read the whole message
without blocking. We'd have to put the socket in non-blocking mode for
that. I'm not sure what the implications of this are.

- we should include system_identifier somewhere in the replication
startup handshake. Otherwise you can connect to server from a different
system and have logs shipped, if they happen to be roughly at the same
point in WAL. Replay will almost certainly fail, but we should error
earlier.

- I know I said we should have just asynchronous replication at first,
but looking ahead, how would you do synchronous? What kind of signaling
is needed between walreceiver and startup process for that?

- 'replication' shouldn't be a real database.

I found the paging logic in walsender confusing, and didn't like the
idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
walreceiver knows how to split the WAL into files without such a flag. I
reworked that logic, I think it's easier to understand now. I kept the
support for the flag in libpq and the protocol for now, but it should be
removed too, or repurposed to indicate that pg_switch_xlog() was done in
the master. I've pushed that to 'replication-orig' branch in my git
repository, attached is the same as a diff against your SR_0914.patch.

I need a break from this patch, so I'll take a closer look at Simon's
hot standby now. Meanwhile, can you work on the above items and submit a
new version, please?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
sr-paging-rework.patch	text/x-diff	32.2 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 08:20:56
Message-ID:	3f0b79eb0909240120t6a7d56b3gfc7119af8bcc287b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Sorry for the delay.

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Having gone through the patch now in more detail, I think it's in pretty
> good shape. I'm happy with the overall design, except that I haven't
> been able to make up my mind if walreceiver should indeed be a
> stand-alone program as discussed, or a postmaster child process as in
> the patch you submitted. Putting that question aside for a moment,
> here's some minor things, in no particular order:

Thanks for the comments.

> - The async API in PQgetXLogData is quite different from the other
> commands. It's close to the API from PQgetCopyData(), but doesn't return
> a malloc'd buffer like PQgetCopyData does. I presume that's to optimize
> away the extra memcpy step?

Yes. This is for preventing extra memcpy.

> I don't think that's really necessary, I
> don't recall any complaints about that in PQgetCopyData(), and if it
> does become an issue, it could be optimized away by mallocing the buffer
> first and reading directly to that.

OK. I'll change PQgetXLogData() to return a malloc'd buffer, and will
remove PQmarkConsumed().

> - Can we avoid sprinkling XLogStreamingAllowed() calls to places where
> we check if WAL-logging is required (nbtsort.c, copy.c etc.). I think we
> need a new macro to encapsulate (XLogArchivingActive() ||
> XLogStreamingAllowed()).

Yes. I'll introduce a new macro XLogIsNeeded() which encapsulates
(XLogArchivingActive() || XLogStreamingAllowed()).

> - Is O_DIRECT ever a good idea in walreceiver? If it's really direct and
> doesn't get cached, the startup process will need to read from disk.

Good point. I agree that O_DIRECT is useless if walreceiver works
with the startup process. It might be useful if only stand-alone walreceiver
program is executed in the standby.

> - Can we replace read/write_conninfo with just a long-enough field in
> shared mem? Would be simpler. (this is moot if we go with the
> stand-alone walreceiver program and pass it as a command-line argument)

Yes, if we can decide the length of conninfo. Since I could not decide
that, I used read/write_conninfo to tell walreceiver the conninfo. Is the
fixed size 1024B enough for conninfo?

> - walreceiver shouldn't die on connection error, just to be restarted by
> startup process. Can we add error handling a la bgwriter and have a
> retry loop within walreceiver? (again, if we go with a stand-alone
> walreceiver program, it's probably better to have startup process
> responsible to restart walreceiver, as it is now)

Error handling a la bgwriter? You mean that PG_exception_stack
should be set up to handle an ERROR exception?

Anyway, I'll change walreceiver to retry connecting to the primary
after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
PQputXLogRecPtr(). Should we set an upper limit of the number of
the retries?

> - pq_wait in backend waits until you can read or write at least 1 byte.
> There is no guarantee that you can send or read the whole message
> without blocking. We'd have to put the socket in non-blocking mode for
> that. I'm not sure what the implications of this are.

Umm... AFAIK, poll and select guarantee that at least the subsequent
recv will not be blocked. If there is only 1 byte available in the buffer,
recv would read that 1 byte and return immediately. I'm not sure if send
will get stuck even after poll is passed. In my environment (RHEL5),
send seems not to be blocked.

> - we should include system_identifier somewhere in the replication
> startup handshake. Otherwise you can connect to server from a different
> system and have logs shipped, if they happen to be roughly at the same
> point in WAL. Replay will almost certainly fail, but we should error
> earlier.

Agreed. I'll do that.

> - I know I said we should have just asynchronous replication at first,
> but looking ahead, how would you do synchronous?

As the previous patch did, I'm going to make walsender read the latest
XLOG from wal_buffers, introduce the signaling between a backend
and walsender, and keep a backend waiting until the specified XLOG
has been written or fsynced in the standby.

> What kind of signaling
> is needed between walreceiver and startup process for that?

I was thinking that the synchronization mode which a client waits
until XLOG has been applied is not necessary right now, so no
signaling is also not required between those processes yet. But,
HS requires this capability?

> - 'replication' shouldn't be a real database.

Agreed. I'll remove that.

> I found the paging logic in walsender confusing, and didn't like the
> idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
> walreceiver knows how to split the WAL into files without such a flag. I
> reworked that logic, I think it's easier to understand now. I kept the
> support for the flag in libpq and the protocol for now, but it should be
> removed too, or repurposed to indicate that pg_switch_xlog() was done in
> the master. I've pushed that to 'replication-orig' branch in my git
> repository, attached is the same as a diff against your SR_0914.patch.
>
> I need a break from this patch, so I'll take a closer look at Simon's
> hot standby now. Meanwhile, can you work on the above items and submit a
> new version, please?

Yeah, sure.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 10:25:02
Message-ID:	3f0b79eb0909240325k2eed8b76v5535da68ae223941@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I found the paging logic in walsender confusing, and didn't like the
> idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
> walreceiver knows how to split the WAL into files without such a flag. I
> reworked that logic, I think it's easier to understand now. I kept the
> support for the flag in libpq and the protocol for now, but it should be
> removed too, or repurposed to indicate that pg_switch_xlog() was done in
> the master. I've pushed that to 'replication-orig' branch in my git
> repository, attached is the same as a diff against your SR_0914.patch.

In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
file after receiving new XLOG records before writing them. This would
increase the backend's waiting time for replication in synchronous case.
The walreceiver should fsync the XLOG file after sending ACK (if needed)
before receiving the next XLOG records?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 10:41:57
Message-ID:	4ABB4CF5.4010908@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
> file after receiving new XLOG records before writing them. This would
> increase the backend's waiting time for replication in synchronous case.
> The walreceiver should fsync the XLOG file after sending ACK (if needed)
> before receiving the next XLOG records?

I don't follow. Wareceiver does fsync the file just after writing it if
the fsync_requested flag was set in the message. Surely that would be
set in synchronous mode, that's what the flag is for, right?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 10:55:49
Message-ID:	3f0b79eb0909240355t48eaad30u191b32a14739e847@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Sep 24, 2009 at 7:41 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Fujii Masao wrote:
>> In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
>> file after receiving new XLOG records before writing them. This would
>> increase the backend's waiting time for replication in synchronous case.
>> The walreceiver should fsync the XLOG file after sending ACK (if needed)
>> before receiving the next XLOG records?
>
> I don't follow. Wareceiver does fsync the file just after writing it if
> the fsync_requested flag was set in the message. Surely that would be
> set in synchronous mode, that's what the flag is for, right?

That's the case where fsync is issued at the end of segment.
In this case, since the fsync_requested flag is not set,
walreceiver doesn't perform fsync in that loop. After the
next XLOG arrives, walreceiver does fsync to the previous file,
in XLogWalRcvWrite().

Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 10:57:22
Message-ID:	4ABB5092.8020502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> - Can we replace read/write_conninfo with just a long-enough field in
>> shared mem? Would be simpler. (this is moot if we go with the
>> stand-alone walreceiver program and pass it as a command-line argument)
>
> Yes, if we can decide the length of conninfo. Since I could not decide
> that, I used read/write_conninfo to tell walreceiver the conninfo. Is the
> fixed size 1024B enough for conninfo?

Yeah, that should be plenty.

>> - walreceiver shouldn't die on connection error, just to be restarted by
>> startup process. Can we add error handling a la bgwriter and have a
>> retry loop within walreceiver? (again, if we go with a stand-alone
>> walreceiver program, it's probably better to have startup process
>> responsible to restart walreceiver, as it is now)
>
> Error handling a la bgwriter? You mean that PG_exception_stack
> should be set up to handle an ERROR exception?

Yep.

> Anyway, I'll change walreceiver to retry connecting to the primary
> after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
> PQputXLogRecPtr(). Should we set an upper limit of the number of
> the retries?

I don't think we need an upper limit.

>> - pq_wait in backend waits until you can read or write at least 1 byte.
>> There is no guarantee that you can send or read the whole message
>> without blocking. We'd have to put the socket in non-blocking mode for
>> that. I'm not sure what the implications of this are.
>
> Umm... AFAIK, poll and select guarantee that at least the subsequent
> recv will not be blocked. If there is only 1 byte available in the buffer,
> recv would read that 1 byte and return immediately. I'm not sure if send
> will get stuck even after poll is passed. In my environment (RHEL5),
> send seems not to be blocked.

Hmm, I guess you're right.

>> - I know I said we should have just asynchronous replication at first,
>> but looking ahead, how would you do synchronous?
>
> As the previous patch did, I'm going to make walsender read the latest
> XLOG from wal_buffers, introduce the signaling between a backend
> and walsender, and keep a backend waiting until the specified XLOG
> has been written or fsynced in the standby.

Ok. I don't think walsender needs to access wal_buffers even then,
though. Once the backend has written the WAL, walsender can well read it
from disk (it will surely be in OS cache still).

>> What kind of signaling
>> is needed between walreceiver and startup process for that?
>
> I was thinking that the synchronization mode which a client waits
> until XLOG has been applied is not necessary right now, so no
> signaling is also not required between those processes yet. But,
> HS requires this capability?

Yeah, I think it will be important with hot standby. It's a much more
useful guarantee that once COMMIT returns, the transactions is visible
in the standby, than that it's merely fsync'd to disk in the standby.

(don't need to solve it now, let's do just asynchronous mode now, but
it's something to keep in mind)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-24 11:03:46
Message-ID:	4ABB5212.9050002@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Thu, Sep 24, 2009 at 7:41 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Fujii Masao wrote:
>>> In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
>>> file after receiving new XLOG records before writing them. This would
>>> increase the backend's waiting time for replication in synchronous case.
>>> The walreceiver should fsync the XLOG file after sending ACK (if needed)
>>> before receiving the next XLOG records?
>> I don't follow. Wareceiver does fsync the file just after writing it if
>> the fsync_requested flag was set in the message. Surely that would be
>> set in synchronous mode, that's what the flag is for, right?
>
> That's the case where fsync is issued at the end of segment.
> In this case, since the fsync_requested flag is not set,
> walreceiver doesn't perform fsync in that loop. After the
> next XLOG arrives, walreceiver does fsync to the previous file,
> in XLogWalRcvWrite().

Ok. I don't see anything wrong with that. If the primary didn't set
fsync_requested, it's not in a hurry to get an acknowledgment.

I guess we could check *after* writing, if we just finished filling the
segment. If we did, we could fsync since we're going to fsync anyway as
soon as we receive the next message. Not sure if it's worth the trouble.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-25 10:05:46
Message-ID:	3f0b79eb0909250305n66be8181t17c35b8a00b80d45@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Anyway, I'll change walreceiver to retry connecting to the primary
>> after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
>> PQputXLogRecPtr(). Should we set an upper limit of the number of
>> the retries?
>
> I don't think we need an upper limit.

Without an upper limit, for example, mis-setting of the primary_conninfo
would make walreceiver repeat PQstartXLogStreaming() forever. Is this OK?

>>> - I know I said we should have just asynchronous replication at first,
>>> but looking ahead, how would you do synchronous?
>>
>> As the previous patch did, I'm going to make walsender read the latest
>> XLOG from wal_buffers, introduce the signaling between a backend
>> and walsender, and keep a backend waiting until the specified XLOG
>> has been written or fsynced in the standby.
>
> Ok. I don't think walsender needs to access wal_buffers even then,
> though. Once the backend has written the WAL, walsender can well read it
> from disk (it will surely be in OS cache still).

I think that walsender should not delay sending the XLOG until it has
been written by the backend, for performance improvement. Otherwise,
XLOG write and send are performed in serial, which would increase a
response time. Should those be performed in parallel?

>>> What kind of signaling
>>> is needed between walreceiver and startup process for that?
>>
>> I was thinking that the synchronization mode which a client waits
>> until XLOG has been applied is not necessary right now, so no
>> signaling is also not required between those processes yet. But,
>> HS requires this capability?
>
> Yeah, I think it will be important with hot standby. It's a much more
> useful guarantee that once COMMIT returns, the transactions is visible
> in the standby, than that it's merely fsync'd to disk in the standby.
>
> (don't need to solve it now, let's do just asynchronous mode now, but
> it's something to keep in mind)

Okey.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-25 10:10:00
Message-ID:	4ABC96F8.8090003@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> - I know I said we should have just asynchronous replication at first,
>>>> but looking ahead, how would you do synchronous?
>>> As the previous patch did, I'm going to make walsender read the latest
>>> XLOG from wal_buffers, introduce the signaling between a backend
>>> and walsender, and keep a backend waiting until the specified XLOG
>>> has been written or fsynced in the standby.
>> Ok. I don't think walsender needs to access wal_buffers even then,
>> though. Once the backend has written the WAL, walsender can well read it
>> from disk (it will surely be in OS cache still).
>
> I think that walsender should not delay sending the XLOG until it has
> been written by the backend, for performance improvement. Otherwise,
> XLOG write and send are performed in serial, which would increase a
> response time. Should those be performed in parallel?

Well, sure, performance is good, but let's keep it simple for now. The
write() to disk should normally be absorbed by the OS cache and return
quickly, so it's not a big delay.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-09-25 10:27:14
Message-ID:	3f0b79eb0909250327p7d2f1cf6gb6eea7624c3f7993@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Sep 25, 2009 at 7:10 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Fujii Masao wrote:
>> On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>>> - I know I said we should have just asynchronous replication at first,
>>>>> but looking ahead, how would you do synchronous?
>>>> As the previous patch did, I'm going to make walsender read the latest
>>>> XLOG from wal_buffers, introduce the signaling between a backend
>>>> and walsender, and keep a backend waiting until the specified XLOG
>>>> has been written or fsynced in the standby.
>>> Ok. I don't think walsender needs to access wal_buffers even then,
>>> though. Once the backend has written the WAL, walsender can well read it
>>> from disk (it will surely be in OS cache still).
>>
>> I think that walsender should not delay sending the XLOG until it has
>> been written by the backend, for performance improvement. Otherwise,
>> XLOG write and send are performed in serial, which would increase a
>> response time. Should those be performed in parallel?
>
> Well, sure, performance is good, but let's keep it simple for now. The
> write() to disk should normally be absorbed by the OS cache and return
> quickly, so it's not a big delay.

Umm... a backend at least should tell walsender the location which it
has written the XLOG before issuing fsync. In the current XLogWrite(),
XLogCtl->LogwrtResult.Write is updated after fsync has been performed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-10-01 01:33:20
Message-ID:	3f0b79eb0909301833l240031a3j96067c7b89d56ae1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Sep 24, 2009 at 5:20 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Meanwhile, can you work on the above items and submit a
>> new version, please?
>
> Yeah, sure.

The attached is the patch to tackle the items, against 'replication-orig' branch
in your git repository.

Changes:
* Change PQgetXLogData() to return a malloc'd buffer instead of a
pointer of the internal buffer.
* Remove PQmarkConsumed().
* Introduce a new macro XLogIsNeeded() which encapsulates
(XLogArchivingActive() || XLogStreamingAllowed()).
* Replace read/write_conninfo with just a long-enough field in shared mem.
* Remove 'replication' database, and support a new keyword
'replication' for pg_hba.conf.
* Include system_identifier in the replication startup handshake.
* Add error handling a la bgwriter and have a retry loop within walreceiver.
* Prevent the startup process from getting stuck when launching
walreceiver fails.

Since we might need to change the patch further, I've not modified the
document yet.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
sr_rework_1001.patch	application/octet-stream	45.4 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-10-01 02:21:24
Message-ID:	3f0b79eb0909301921r2a516c46n5d6591be80d46d43@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Walreceiver is really a slave to the startup process. The startup
> process decides when it's launched, and it's the startup process that
> then waits for it to advance. But the way it's set up at the moment, the
> startup process needs to ask the postmaster to start it up, and it
> doesn't look very robust to me. For example, if launching walreceiver
> fails for some reason, startup process will just hang waiting for it.

I changed the postmaster to report the failure of fork of the walreceiver
to the startup process by resetting WalRcv->in_progress, which prevents
the startup process from getting stuck when launching walreceiver fails.
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01996.php

Do you have another concern about the robustness? If yes, I'll address that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-10-06 08:45:37
Message-ID:	3f0b79eb0910060145ref434ecwa3cbdc78b8403b89@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I've pushed that to 'replication-orig' branch in my git
> repository, attached is the same as a diff against your SR_0914.patch.

The following changes about crossing a xlogid boundary seem wrong,
which would break the management of some XLOG positions.

> ! /* Update state for read */
> ! tmp = recptr.xrecoff + byteswritten;
> ! if (tmp < recptr.xrecoff)
> ! recptr.xlogid++; /* overflow */
> ! recptr.xrecoff = tmp;

> ! endptr.xrecoff += MAX_SEND_SIZE;
> ! if(endptr.xrecoff < startptr.xrecoff)
> ! endptr.xlogid++; /* xrecoff overflowed */

> ! if (endptr.xlogid != startptr.xlogid)
> {
> ! Assert(endptr.xlogid == startptr.xlogid + 1);
> ! nbytes = (0xffffffff - endptr.xrecoff) + startptr.xrecoff;
> ! }

The size of a logical XLOG file is 0xff000000. So, even if xrecoff has
not been overflowed yet, we might need to cross a xlogid boundary.
The xrecoff should be compared with XLogFileSize, I think. Can I fix those?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-10-06 13:42:01
Message-ID:	20091006134200.GC5929@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao escribió:
> On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > Walreceiver is really a slave to the startup process. The startup
> > process decides when it's launched, and it's the startup process that
> > then waits for it to advance. But the way it's set up at the moment, the
> > startup process needs to ask the postmaster to start it up, and it
> > doesn't look very robust to me. For example, if launching walreceiver
> > fails for some reason, startup process will just hang waiting for it.
>
> I changed the postmaster to report the failure of fork of the walreceiver
> to the startup process by resetting WalRcv->in_progress, which prevents
> the startup process from getting stuck when launching walreceiver fails.
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01996.php
>
> Do you have another concern about the robustness? If yes, I'll address that.

Hmm. Without looking at the patch at all, this seems similar to how
autovacuum does things: autovac launcher signals postmaster that a
worker needs to be started. Postmaster proceeds to fork a worker. This
could obviously fail for a lot of reasons.

Now, there is code in place to notify the user when forking fails, and
this is seen on the wild quite a bit more than one would like :-( I
think it would be a good idea to have a retry mechanism in the
walreceiver startup mechanism so that recovery does not get stuck due to
transient problems.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming Replication patch for CommitFest 2009-09
Date:	2009-10-07 01:36:33
Message-ID:	3f0b79eb0910061836u7174e9b9p5a727c7572253590@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Oct 6, 2009 at 10:42 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Hmm. Without looking at the patch at all, this seems similar to how
> autovacuum does things: autovac launcher signals postmaster that a
> worker needs to be started. Postmaster proceeds to fork a worker. This
> could obviously fail for a lot of reasons.

Yeah, I drew upon the autovac code.

> Now, there is code in place to notify the user when forking fails, and
> this is seen on the wild quite a bit more than one would like :-( I
> think it would be a good idea to have a retry mechanism in the
> walreceiver startup mechanism so that recovery does not get stuck due to
> transient problems.

Agreed. The latest patch provides the retry mechanism.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center