Re: Synch Rep for CommitFest 2009-07

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Synch Rep for CommitFest 2009-07
Date: 2009-07-14 14:01:27
Message-ID: 3f0b79eb0907140701j2cfc4a7bla4770828e8f9433b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Fri, Jul 3, 2009 at 1:32 PM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
>> This patch no longer applies cleanly.  Can you rebase and resubmit it
>> for the upcoming CommitFest?  It might also be good to go through and
>> clean up the various places where you have trailing whitespace and/or
>> spaces preceding tabs.
>
> Sure. I'll resubmit the patch after fixing some bugs and finishing
> the documents.

Here is the updated version of Synch Rep patch. I adjusted the patch
against CVS HEAD, fixed some bugs and updated the documents.

The attached tarball contains some patches which were split to be
reviewed easily. Description of each patches, a brief procedure to
set up Synch Rep and the functional overview of it are in wiki.
http://wiki.postgresql.org/wiki/NTT's_Development_Projects

If you notice anything, please feel free to comment!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
synch_rep_0714.tgz application/x-gzip 171.3 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-14 18:56:24
Message-ID: 4A5CD4D8.7030403@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> On Fri, Jul 3, 2009 at 1:32 PM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
>>> This patch no longer applies cleanly. Can you rebase and resubmit it
>>> for the upcoming CommitFest? It might also be good to go through and
>>> clean up the various places where you have trailing whitespace and/or
>>> spaces preceding tabs.
>> Sure. I'll resubmit the patch after fixing some bugs and finishing
>> the documents.
>
> Here is the updated version of Synch Rep patch. I adjusted the patch
> against CVS HEAD, fixed some bugs and updated the documents.
>
> The attached tarball contains some patches which were split to be
> reviewed easily. Description of each patches, a brief procedure to
> set up Synch Rep and the functional overview of it are in wiki.
> http://wiki.postgresql.org/wiki/NTT's_Development_Projects
>
> If you notice anything, please feel free to comment!

Here's one little thing in addition to all the stuff already discussed:

The only caller that doesn't pass XLogSyncReplication as the new 'mode'
argument to XLogFlush is this CreateCheckPoint:

***************
*** 6569,6575 ****
XLOG_CHECKPOINT_ONLINE,
&rdata);

! XLogFlush(recptr);

/*
* We mustn't write any new WAL after a shutdown checkpoint, or it will
--- 7667,7677 ----
XLOG_CHECKPOINT_ONLINE,
&rdata);

! /*
! * Don't shutdown until all outstanding xlog records are replicated and
! * fsynced on the standby, regardless of synchronization mode.
! */
! XLogFlush(recptr, shutdown ? REPLICATION_MODE_FSYNC :
XLogSyncReplication);

/*
* We mustn't write any new WAL after a shutdown checkpoint, or it will

If that's the only such caller, let's introduce a new function for that
and keep the XLogFlush() api unchanged.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-15 04:32:33
Message-ID: 3f0b79eb0907142132va77ccc2m87a3ca374f93dd58@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Wed, Jul 15, 2009 at 3:56 AM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Here's one little thing in addition to all the stuff already discussed:

Thanks for the comment!

> If that's the only such caller, let's introduce a new function for that
> and keep the XLogFlush() api unchanged.

OK. How about the following function?

------------------
/*
* Ensure that shutdown-related XLOG data through the given position is
* flushed to local disk, and also flushed to the disk in the standby
* if replication is in progress.
*/
void
XLogShutdownFlush(XLogRecPtr record)
{
int save_mode = XLogSyncReplication;

XLogSyncReplication = REPLICATION_MODE_FSYNC;
XLogFlush(record);

XLogSyncReplication = save_mode;
}
------------------

In a shutdown checkpoint case, CreateCheckPoint calls
XLogShutdownFlush, otherwise XLogFlush. And,
XLogFlush uses XLogSyncReplication directly instead of
obsolete 'mode' argument.

If the above is OK, should I update the patch ASAP? or
suspend that update until many other comments arrive?
I'm concerned that frequent small updating interferes with
a review.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-15 11:15:28
Message-ID: 603c8f070907150415v46a0afc3l53ed6a166fbabc1d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 15, 2009 at 12:32 AM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
> If the above is OK, should I update the patch ASAP? or
> suspend that update until many other comments arrive?
> I'm concerned that frequent small updating interferes with
> a review.

I decided (perhaps foolishly), to assign reviewers for the two smaller
patches that you extracted from this first, and to hold off on
assigning a reviewer for the main patch until those reviews were
completed:

http://archives.postgresql.org/message-id/3f0b79eb0907022341m1d36a841x19c3e2a5a6906b5b@mail.gmail.com
http://archives.postgresql.org/message-id/3f0b79eb0907030037g515f3337o9092279c62348dc@mail.gmail.com

So I think you should update ASAP in this case. As soon as we get
some reviewers freed up from the initial reviewing round, I will
assign one or more reviewers to the main Sync Rep patch.

...Robert


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-15 16:45:03
Message-ID: 3f0b79eb0907150945r1d157d38y58c3fc53347f30d1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Wed, Jul 15, 2009 at 8:15 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
> So I think you should update ASAP in this case.

I updated the patch as described in
http://archives.postgresql.org/pgsql-hackers/2009-07/msg00865.php

All the other parts are still the same.

>  As soon as we get
> some reviewers freed up from the initial reviewing round, I will
> assign one or more reviewers to the main Sync Rep patch.

Thanks!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
synch_rep_0716.tgz application/x-gzip 169.6 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-15 21:03:25
Message-ID: 4A5E441D.1080004@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> Hi,
>
> On Wed, Jul 15, 2009 at 8:15 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>> So I think you should update ASAP in this case.
>
> I updated the patch as described in
> http://archives.postgresql.org/pgsql-hackers/2009-07/msg00865.php
>
> All the other parts are still the same.
>
>> As soon as we get
>> some reviewers freed up from the initial reviewing round, I will
>> assign one or more reviewers to the main Sync Rep patch.
>
> Thanks!

I don't think there's much point assigning more reviewers to Synch Rep
at this point. I believe we have consensus on four major changes:

1. Change the way synchronization is done when standby connects to
primary. After authentication, standby should send a message to primary,
stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
segment name). Primary starts streaming WAL starting from that point,
and keeps streaming forever. pg_read_xlogfile() needs to be removed.

2. The primary should have no business reading back from the archive.
The standby can read from the archive, as it can today.

3. Need to support multiple WALSenders. While multiple slave support
isn't 1st priority right now, it's not acceptable that a new WALSender
can't connect while one is active already. That can cause trouble in
case of network problems etc.

4. It is not acceptable that normal backends have to wait for walsender
to send data. That means that connecting a standby behind a slow
connection to the primary can grind the primary to a halt. walsender
needs to be able to read data from disk, not just from shared memory. (I
raised this back in December
http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

Those 4 things are big enough changes that I don't think there's much
left to review that won't be affected by those changes.

As a hint, I think you'll find it a lot easier if you implement only
asynchronous replication at first. That reduces the amount of
inter-process communication a lot. You can then add synchronous
capability in a later commitfest. I would also suggest that for point 4,
you implement WAL sender so that it *only* reads from disk at first, and
only add the capability send from wal_buffers later on, and only if
performance testing shows that it's needed.

I'll move this to "returned with feedback" section, but if you get those
things done quickly we can still give it another round of review in this
commitfest.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-15 21:59:34
Message-ID: 3B521971-E556-4793-8C96-903A78903E16@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :
> 2. The primary should have no business reading back from the archive.
> The standby can read from the archive, as it can today.

Sorry to insist, but I'm not sold on your consensus here, yet:
http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php

There's a true need for the solution to be simple to install, and
providing a side channel for the standby to go read the archives
itself isn't it. Furthermore, the counter-argument against having the
primary able to send data from the archives to some standby is that it
should still work when primary's dead, but as this is only done in the
setup phase, I don't see that being able to continue preparing a not-
yet-ready standby against a dead primary is buying us anything.

Now, I tried proposing to implement an archive server as a postmaster
child to have a reference implementation of an archive command for
"basic" cases, and provide the ability to give data from the archive
to slave(s). But this is getting too much into the implementation
details for my current understanding of them :)

Regards,
--
dim


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 06:07:16
Message-ID: 4A5EC394.6080905@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine wrote:
> Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :
>> 2. The primary should have no business reading back from the archive.
>> The standby can read from the archive, as it can today.
>
> Sorry to insist, but I'm not sold on your consensus here, yet:
> http://archives.postgresql.org/pgsql-hackers/2009-07/msg00486.php
>
> There's a true need for the solution to be simple to install, and
> providing a side channel for the standby to go read the archives itself
> isn't it.

I think a better way to address that need is to provide a built-in
mechanism for the standby to request a base backup and have it sent over
the wire. That makes the initial setup very easy.

> Furthermore, the counter-argument against having the primary
> able to send data from the archives to some standby is that it should
> still work when primary's dead, but as this is only done in the setup
> phase, I don't see that being able to continue preparing a not-yet-ready
> standby against a dead primary is buying us anything.

The situation arises also when the standby falls badly behind. A simple
solution to that is to add a switch in the master to specify "always
keep X MB of WAL in pg_xlog". The standby will then still find it in
pg_xlog, making it harder for a standby to fall so much behind that it
can't find the WAL it needs in the primary anymore. Tom suggested that
we can just give up and re-sync with a new base backup, but that really
requires built-in base backup capability, and is only practical for
small databases.

I think we should definitely have both those features, but it's not
urgent. The replication works without them, although requires that you
set up traditional archiving as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 07:53:22
Message-ID: 87fxcxnjwt.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I think a better way to address that need is to provide a built-in
> mechanism for the standby to request a base backup and have it sent over
> the wire. That makes the initial setup very easy.

Great idea :)

So I'll reproduce the sketch I did in this other mail, adding the 'base'
state where the prerequisite base backup is handled, that will help
clarify the next points:

0. base: slave asks the master for a base-backup, at the end of this it
reaches the base-lsn

1. init: slave asks the master the current LSN and start streaming WAL

2. setup: slave asks the master for missing WALs from its base-lsn to
this LSN it just got, and apply them all to reach initial LSN (this
happens in parallel to 1.)

3. catchup: slave has replayed missing WALs and now is replaying the
stream he received in parallel, and which applies from init LSN
(just reached)

4. sync: slave is applying the stream as it gets it, either as part of
the master transaction or not depending on the GUC settings

> The situation arises also when the standby falls badly behind. A simple
> solution to that is to add a switch in the master to specify "always
> keep X MB of WAL in pg_xlog". The standby will then still find it in
> pg_xlog, making it harder for a standby to fall so much behind that it
> can't find the WAL it needs in the primary anymore. Tom suggested that
> we can just give up and re-sync with a new base backup, but that really
> requires built-in base backup capability, and is only practical for
> small databases.

I think that when the standby is back in business after a connection
glitch (or any other transient error), its current internal state is
still 'sync' and walreceiver asks for next LSN (RedoPTR?). Now, 2 cases
are possible:

a. primary still has it handy, so the standby is still in sync but
lagging behind (and primary knows how much)

b. primary is not able to provide the requested WAL entry, so the slave
is back to 'setup' state, with base-lsn the point reached just
before loosing sync (the one walreceiver just asked for).

Now, a standby in 'setup' state isn't ready (yet), and for example
synchronous replication won't be possible in this state: we can't ask
the primary to refuse to COMMIT any transaction (holding it, eg) while a
standby hasn't reached 'sync' state.

The way your talking about the issue make me think there's a mix between
how to handle a lagging standby and an out-of-sync standby. For clarity,
I think we should have very distinct states and responses. And yes, as
Tom and you keep saying, a synced standby by definition should not need
any access to its primary archives. So if it does, it's no more in sync.

> I think we should definitely have both those features, but it's not
> urgent. The replication works without them, although requires that you
> set up traditional archiving as well.

Agreed, it's not essential for the feature as far as hackers are
concerned.

Regards,
--
dim


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 08:28:42
Message-ID: 3f0b79eb0907160128h53d0c5feh7c4e4815c8471e67@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thu, Jul 16, 2009 at 6:03 AM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I don't think there's much point assigning more reviewers to Synch Rep
> at this point. I believe we have consensus on four major changes:

Thanks for clarifying the issues! Okey, I'll rework the patch.

> 1. Change the way synchronization is done when standby connects to
> primary. After authentication, standby should send a message to primary,
> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
> segment name). Primary starts streaming WAL starting from that point,
> and keeps streaming forever. pg_read_xlogfile() needs to be removed.

I assume that <begin> should indicate the location of the last valid record.
In other words, at first the standby tries to recover by using only the XLOG
files which exist in its archive or pg_xlog. When it has reached the last valid
record, it requests the XLOG records which follow <begin> to the primary.
Is my understanding OK?

http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
As I described before, the XLOG file which the standby creates should be
recoverable. So, when <begin> indicates the middle of the XLOG file, the
primary should start sending the records from the head of the file including
<begin>. Is this OK?

Or, the primary should start from <begin>? In this case, since we can
expect that the incomplete file including <begin> would exist in also the
standby, the records following <begin> need to be appended into it.
And, if that incomplete file is the restored one from archive, it would need
to be renamed from a temporary name before being appended.

A timeline/backup history file is also required for recovery, but it's not
found in the standby. So, they need to be shipped from the primary, and
this capability is provided by pg_read_xlogfile(). If removing the function,
how should we transfer those history files? The function similar to
pg_read_xlogfile() with which the filename needs to be specified is still
necessary?

> 2. The primary should have no business reading back from the archive.
> The standby can read from the archive, as it can today.

In this case, a backup history file should be stored in pg_xlog for a while,
because it might be requested by the standby. So far pg_start_backup()
has removed the previous backup history file soon. We should introduce
a new GUC parameter to determine how many backup history files should
exist in pg_xlog?

CHECKPOINT should not recycle the XLOG files following the file which
is requested by the standby in that moment. So, we need to tweak the
recycling policy.

> 3. Need to support multiple WALSenders. While multiple slave support
> isn't 1st priority right now, it's not acceptable that a new WALSender
> can't connect while one is active already. That can cause trouble in
> case of network problems etc.

Sorry, I didn't get your point. You think multiple slave support isn't 1st
priority, and yet why should multiple walsender mechanism be necessary?
Can you describe the problem cases in more detail?

> 4. It is not acceptable that normal backends have to wait for walsender
> to send data.

Umm... this is true in asynchronous replication case. Also true while the
standby is catching up with the primary. After those servers get into
synchronization, the backend should wait for walsender to send data (and
also walreceiver to write/fsync data) before returning "success" of COMMIT
to the client. Is my understanding right?

In current Synch Rep, the backend basically doesn't wait for walsender in
asynchronous mode. But only when wal_buffers is filled with unsent data,
the backend waits for walsender to send data because there is no room to
insert new data. You suggest only that this problem case should be solved?

> That means that connecting a standby behind a slow
> connection to the primary can grind the primary to a halt.

This is the fate of *synchronous* replication, isn't it? If a user want to get
around such problem, asynchronous mode should be chosen, I think.

> walsender
> needs to be able to read data from disk, not just from shared memory. (I
> raised this back in December
> http://archives.postgresql.org/message-id/495106FA.1050605@enterprisedb.com)

OK, I'll try it.

> As a hint, I think you'll find it a lot easier if you implement only
> asynchronous replication at first. That reduces the amount of
> inter-process communication a lot. You can then add synchronous
> capability in a later commitfest. I would also suggest that for point 4,
> you implement WAL sender so that it *only* reads from disk at first, and
> only add the capability send from wal_buffers later on, and only if
> performance testing shows that it's needed.

Sounds good. I'll advance development in stages as you suggested.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 09:00:07
Message-ID: 4A5EEC17.6010903@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> On Thu, Jul 16, 2009 at 6:03 AM, Heikki
> Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> 1. Change the way synchronization is done when standby connects to
>> primary. After authentication, standby should send a message to primary,
>> stating the <begin> point (where <begin> is an XLogRecPtr, not a WAL
>> segment name). Primary starts streaming WAL starting from that point,
>> and keeps streaming forever. pg_read_xlogfile() needs to be removed.
>
> I assume that <begin> should indicate the location of the last valid record.
> In other words, at first the standby tries to recover by using only the XLOG
> files which exist in its archive or pg_xlog. When it has reached the last valid
> record, it requests the XLOG records which follow <begin> to the primary.
> Is my understanding OK?

Yes.

> http://archives.postgresql.org/pgsql-hackers/2009-07/msg00475.php
> As I described before, the XLOG file which the standby creates should be
> recoverable. So, when <begin> indicates the middle of the XLOG file, the
> primary should start sending the records from the head of the file including
> <begin>. Is this OK?
>
> Or, the primary should start from <begin>? In this case, since we can
> expect that the incomplete file including <begin> would exist in also the
> standby, the records following <begin> need to be appended into it.

I would expect the standby to append to the partial XLOG file.

> And, if that incomplete file is the restored one from archive, it would need
> to be renamed from a temporary name before being appended.

The archive should not normally contain partial XLOG files, only if you
manually copy one there after primary has crashed. So I don't think
that's something we need to support.

> A timeline/backup history file is also required for recovery, but it's not
> found in the standby. So, they need to be shipped from the primary, and
> this capability is provided by pg_read_xlogfile(). If removing the function,
> how should we transfer those history files? The function similar to
> pg_read_xlogfile() with which the filename needs to be specified is still
> necessary?

Hmm. You only need the timeline history file if the base backup was
taken in an earlier timeline. That situation would only arise if you
(manually) take a base backup, restore to a server (which creates a new
timeline), and then create a slave against that server. At least in the
1st phase, I think we can assume that the standby has access to the same
archive, and will find the history file from there. If not, throw an
error. We can add more bells and whistles later.

> CHECKPOINT should not recycle the XLOG files following the file which
> is requested by the standby in that moment. So, we need to tweak the
> recycling policy.

Yep.

>> 3. Need to support multiple WALSenders. While multiple slave support
>> isn't 1st priority right now, it's not acceptable that a new WALSender
>> can't connect while one is active already. That can cause trouble in
>> case of network problems etc.
>
> Sorry, I didn't get your point. You think multiple slave support isn't 1st
> priority, and yet why should multiple walsender mechanism be necessary?
> Can you describe the problem cases in more detail?

As the patch stands, new walsender connections are refused when one is
active already. What if the walsender connection is in a zombie state?
For example, it's trying to send WAL to the slave, but the network
connection is down, and the packets are going to a black hole. It will
take a while for the TCP layer to declare the connection dead, and close
the socket. During that time, you can't connect a new slave to the
master, or the same slave using a better network connection.

The most robust way to fix that is to support multiple walsenders. The
zombie walsender can take its time to die, while the new walsender
serves the new connection. You could tweak SO_TIMEOUTs and stuff, but
even then the standby process could be in some weird hung state.

And of course, when we get around to add support for multiple slaves,
we'll have to do that anyway. Better get it right to begin with.

>> 4. It is not acceptable that normal backends have to wait for walsender
>> to send data.
>
> Umm... this is true in asynchronous replication case. Also true while the
> standby is catching up with the primary. After those servers get into
> synchronization, the backend should wait for walsender to send data (and
> also walreceiver to write/fsync data) before returning "success" of COMMIT
> to the client. Is my understanding right?

Even in synchronous replication, a backend should only have to wait when
it commits. You would only see the difference with very large
transactions that write more WAL than fits in wal_buffers, though, like
data loading.

> In current Synch Rep, the backend basically doesn't wait for walsender in
> asynchronous mode. But only when wal_buffers is filled with unsent data,
> the backend waits for walsender to send data because there is no room to
> insert new data. You suggest only that this problem case should be solved?

Right, that is the problem.

>> That means that connecting a standby behind a slow
>> connection to the primary can grind the primary to a halt.
>
> This is the fate of *synchronous* replication, isn't it? If a user want to get
> around such problem, asynchronous mode should be chosen, I think.

Right. But as the patch stands, asynchronous mode has the same problem,
which is not acceptable.

> Sounds good. I'll advance development in stages as you suggested.

Thanks!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Rick Gigger <rick(at)alpinenetworking(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 09:34:56
Message-ID: DA346E39-AE5A-4DB3-A3AA-60566DE1978C@alpinenetworking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 16, 2009, at 12:07 AM, Heikki Linnakangas wrote:

> Dimitri Fontaine wrote:
>> Le 15 juil. 09 à 23:03, Heikki Linnakangas a écrit :
>> Furthermore, the counter-argument against having the primary
>> able to send data from the archives to some standby is that it should
>> still work when primary's dead, but as this is only done in the setup
>> phase, I don't see that being able to continue preparing a not-yet-
>> ready
>> standby against a dead primary is buying us anything.
>
> The situation arises also when the standby falls badly behind. A
> simple
> solution to that is to add a switch in the master to specify "always
> keep X MB of WAL in pg_xlog". The standby will then still find it in
> pg_xlog, making it harder for a standby to fall so much behind that it
> can't find the WAL it needs in the primary anymore. Tom suggested that
> we can just give up and re-sync with a new base backup, but that
> really
> requires built-in base backup capability, and is only practical for
> small databases.

If you use an rsync like algorithm for doing the base backups wouldn't
that increase the size of the database for which it would still be
practical to just re-sync? Couldn't you in fact sync a very large
database if the amount of actual change in the files was a small
percentage of the total size?


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Rick Gigger <rick(at)alpinenetworking(dot)com>
Cc: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 15:41:37
Message-ID: 4A5F4A31.5050701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rick Gigger wrote:
> If you use an rsync like algorithm for doing the base backups wouldn't
> that increase the size of the database for which it would still be
> practical to just re-sync? Couldn't you in fact sync a very large
> database if the amount of actual change in the files was a small
> percentage of the total size?

It would certainly help to reduce the network traffic, though you'd
still have to scan all the data to see what has changed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Rick Gigger <rick(at)alpinenetworking(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 17:09:19
Message-ID: 407d949e0907161009s122c95b3ia567316ab50ff7be@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 16, 2009 at 4:41 PM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Rick Gigger wrote:
>> If you use an rsync like algorithm for doing the base backups wouldn't
>> that increase the size of the database for which it would still be
>> practical to just re-sync?  Couldn't you in fact sync a very large
>> database if the amount of actual change in the files was a small
>> percentage of the total size?
>
> It would certainly help to reduce the network traffic, though you'd
> still have to scan all the data to see what has changed.

The fundamental problem with pushing users to start over with a new
base backup is that there's no relationship between the size of the
WAL and the size of the database.

You can plausibly have a system with extremely high transaction rate
generating WAL very quickly, but where the whole database fits in a
few hundred megabytes. In that case you could be behind by only a few
minutes and have it be faster to take a new base backup.

Or you could have a petabyte database which is rarely updated. In
which case it might be faster to apply weeks' worth of logs than to
try to take a base backup.

Only the sysadmin is actually going to know which makes more sense.
Unless we start tieing WAL parameters to the database size or
something like that.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Rick Gigger <rick(at)alpinenetworking(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 18:45:26
Message-ID: 2493CEC8-2752-42E9-85C8-5D75F27D9C3A@alpinenetworking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 16, 2009, at 11:09 AM, Greg Stark wrote:

> On Thu, Jul 16, 2009 at 4:41 PM, Heikki
> Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Rick Gigger wrote:
>>> If you use an rsync like algorithm for doing the base backups
>>> wouldn't
>>> that increase the size of the database for which it would still be
>>> practical to just re-sync? Couldn't you in fact sync a very large
>>> database if the amount of actual change in the files was a small
>>> percentage of the total size?
>>
>> It would certainly help to reduce the network traffic, though you'd
>> still have to scan all the data to see what has changed.
>
> The fundamental problem with pushing users to start over with a new
> base backup is that there's no relationship between the size of the
> WAL and the size of the database.
>
> You can plausibly have a system with extremely high transaction rate
> generating WAL very quickly, but where the whole database fits in a
> few hundred megabytes. In that case you could be behind by only a few
> minutes and have it be faster to take a new base backup.
>
> Or you could have a petabyte database which is rarely updated. In
> which case it might be faster to apply weeks' worth of logs than to
> try to take a base backup.
>
> Only the sysadmin is actually going to know which makes more sense.
> Unless we start tieing WAL parameters to the database size or
> something like that.

Once again wouldn't an rsync like algorithm help here. Couldn't you
have the default be to just create a new base backup for them , but
then allow you to specify an existing base backup if you've already
got one?


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Rick Gigger <rick(at)alpinenetworking(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-16 19:47:54
Message-ID: 603c8f070907161247y486bba9di8178d9dcf681f367@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 16, 2009 at 1:09 PM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
> On Thu, Jul 16, 2009 at 4:41 PM, Heikki
> Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Rick Gigger wrote:
>>> If you use an rsync like algorithm for doing the base backups wouldn't
>>> that increase the size of the database for which it would still be
>>> practical to just re-sync?  Couldn't you in fact sync a very large
>>> database if the amount of actual change in the files was a small
>>> percentage of the total size?
>>
>> It would certainly help to reduce the network traffic, though you'd
>> still have to scan all the data to see what has changed.
>
> The fundamental problem with pushing users to start over with a new
> base backup is that there's no relationship between the size of the
> WAL and the size of the database.
>
> You can plausibly have a system with extremely high transaction rate
> generating WAL very quickly, but where the whole database fits in a
> few hundred megabytes. In that case you could be behind by only a few
> minutes and have it be faster to take a new base backup.
>
> Or you could have a petabyte database which is rarely updated. In
> which case it might be faster to apply weeks' worth of logs than to
> try to take a base backup.
>
> Only the sysadmin is actually going to know which makes more sense.
> Unless we start tieing WAL parameters to the database size or
> something like that.

I think we need a way for the master to know who its slaves are and
keep any given bit of WAL available until all slaves have succesfully
read it, just as we keep each WAL file until we successfully copy it
to the archive. Otherwise, there's no way to be sure that a
connection break won't result in the need for a new base backup. (In
a way, a slave is very similar to an additional archive.)

...Robert


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-17 09:15:11
Message-ID: 3f0b79eb0907170215n5765442bw4ea99b031199ba5b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thu, Jul 16, 2009 at 6:00 PM, Heikki
Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> The archive should not normally contain partial XLOG files, only if you
> manually copy one there after primary has crashed. So I don't think
> that's something we need to support.

You are right. And, if the last valid record exists in the middle of
the restored
file (e.g. by XLOG_SWITCH record), <begin> should indicate the head of the
next file.

> Hmm. You only need the timeline history file if the base backup was
> taken in an earlier timeline. That situation would only arise if you
> (manually) take a base backup, restore to a server (which creates a new
> timeline), and then create a slave against that server. At least in the
> 1st phase, I think we can assume that the standby has access to the same
> archive, and will find the history file from there. If not, throw an
> error. We can add more bells and whistles later.

Okey, I hold the problem about a history file for possible later consideration.

> As the patch stands, new walsender connections are refused when one is
> active already. What if the walsender connection is in a zombie state?
> For example, it's trying to send WAL to the slave, but the network
> connection is down, and the packets are going to a black hole. It will
> take a while for the TCP layer to declare the connection dead, and close
> the socket. During that time, you can't connect a new slave to the
> master, or the same slave using a better network connection.
>
> The most robust way to fix that is to support multiple walsenders. The
> zombie walsender can take its time to die, while the new walsender
> serves the new connection. You could tweak SO_TIMEOUTs and stuff, but
> even then the standby process could be in some weird hung state.
>
> And of course, when we get around to add support for multiple slaves,
> we'll have to do that anyway. Better get it right to begin with.

Thanks for the detailed description! I was thinking that a new GUC
replication_timeout and some keepalive parameters would be enough to
help with such trouble. But I agree that the support multiple walsenders
is better solution, so I'll try this problem.

> Even in synchronous replication, a backend should only have to wait when
> it commits. You would only see the difference with very large
> transactions that write more WAL than fits in wal_buffers, though, like
> data loading.

That's right.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Rick Gigger <rick(at)alpinenetworking(dot)com>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synch Rep for CommitFest 2009-07
Date: 2009-07-17 09:37:01
Message-ID: 3f0b79eb0907170237p5d34176k5be97adcc7e3d6d7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Fri, Jul 17, 2009 at 2:09 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
> Only the sysadmin is actually going to know which makes more sense.
> Unless we start tieing WAL parameters to the database size or
> something like that.

Agreed. And, if a user doesn't want to make a new base backup because
of a large database, s/he can manually copy the archived WAL files to the
standby before starting it, and make it use them for its recovery.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center