Quick Links

Re: Streaming replication, retrying from archive

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Streaming replication, retrying from archive
Date:	2010-01-14 14:15:14
Message-ID:	4B4F26F2.3060001@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Imagine this scenario:

1. Master is up and running, standby is connected and streaming happily
2. Network goes down, connection is broken.
3. Standby falls behind a lot. Old WAL files that the standby needs are
archived, and deleted from master.
4. Network is restored. Standby reconnects
5. Standby will get an error because the WAL file it needs is not in the
master anymore.

What will currently happen is:

6, Standby retries connecting and failing indefinitely, until the admin
restarts it.

What we would *like* to happen is:

6. Standby fetches the missing WAL files from archive, then reconnects
and continues streaming.

Can we fix that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 14:36:11
Message-ID:	603c8f071001140636p2551a086w451a9bb0a96fa2c4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Imagine this scenario:
>
> 1. Master is up and running, standby is connected and streaming happily
> 2. Network goes down, connection is broken.
> 3. Standby falls behind a lot. Old WAL files that the standby needs are
> archived, and deleted from master.
> 4. Network is restored. Standby reconnects
> 5. Standby will get an error because the WAL file it needs is not in the
> master anymore.
>
> What will currently happen is:
>
> 6, Standby retries connecting and failing indefinitely, until the admin
> restarts it.
>
> What we would *like* to happen is:
>
> 6. Standby fetches the missing WAL files from archive, then reconnects
> and continues streaming.
>
> Can we fix that?

Just MHO here, but this seems like a bigger project than we should be
starting at this stage of the game.

...Robert

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 14:39:59
Message-ID:	9837222c1001140639q365ca452p8daf7301133d5317@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 14, 2010 at 15:36, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Imagine this scenario:
>>
>> 1. Master is up and running, standby is connected and streaming happily
>> 2. Network goes down, connection is broken.
>> 3. Standby falls behind a lot. Old WAL files that the standby needs are
>> archived, and deleted from master.
>> 4. Network is restored. Standby reconnects
>> 5. Standby will get an error because the WAL file it needs is not in the
>> master anymore.
>>
>> What will currently happen is:
>>
>> 6, Standby retries connecting and failing indefinitely, until the admin
>> restarts it.
>>
>> What we would *like* to happen is:
>>
>> 6. Standby fetches the missing WAL files from archive, then reconnects
>> and continues streaming.
>>
>> Can we fix that?
>
> Just MHO here, but this seems like a bigger project than we should be
> starting at this stage of the game.

+1.

We want this eventually (heck, it'd be awesome!), but let's get what
we have now stable first.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 15:23:26
Message-ID:	4B4F36EE.5070401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Magnus Hagander wrote:
> On Thu, Jan 14, 2010 at 15:36, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Imagine this scenario:
>>>
>>> 1. Master is up and running, standby is connected and streaming happily
>>> 2. Network goes down, connection is broken.
>>> 3. Standby falls behind a lot. Old WAL files that the standby needs are
>>> archived, and deleted from master.
>>> 4. Network is restored. Standby reconnects
>>> 5. Standby will get an error because the WAL file it needs is not in the
>>> master anymore.
>>>
>>> What will currently happen is:
>>>
>>> 6, Standby retries connecting and failing indefinitely, until the admin
>>> restarts it.
>>>
>>> What we would *like* to happen is:
>>>
>>> 6. Standby fetches the missing WAL files from archive, then reconnects
>>> and continues streaming.
>>>
>>> Can we fix that?
>> Just MHO here, but this seems like a bigger project than we should be
>> starting at this stage of the game.
>
> +1.
>
> We want this eventually (heck, it'd be awesome!), but let's get what
> we have now stable first.

If we don't fix that within the server, we will need to document that
caveat and every installation will need to work around that one way or
another. Maybe with some monitoring software and an automatic restart. Ugh.

I wasn't really asking if it's possible to fix, I meant "Let's think
about *how* to fix that".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 16:06:36
Message-ID:	87d41c7jsj.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> If we don't fix that within the server, we will need to document that
> caveat and every installation will need to work around that one way or
> another. Maybe with some monitoring software and an automatic restart. Ugh.
>
> I wasn't really asking if it's possible to fix, I meant "Let's think
> about *how* to fix that".

Did I mention my viewpoint on that already?
http://archives.postgresql.org/pgsql-hackers/2009-07/msg00943.php

It could well be I'm talking about things that have no relation at all
to what is in the patch currently, and that make no sense for where we
want the patch to go. But I'd like to know about that so that I'm not
banging my head on the nearest wall each time the topic surfaces.

Regards,
--
dim

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 16:28:03
Message-ID:	3f0b79eb1001140828ne2f6235q86b1604adbe80339@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 15, 2010 at 12:23 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> If we don't fix that within the server, we will need to document that
> caveat and every installation will need to work around that one way or
> another. Maybe with some monitoring software and an automatic restart. Ugh.
>
> I wasn't really asking if it's possible to fix, I meant "Let's think
> about *how* to fix that".

OK. How about the following (though it's a rough design)?

(1) If walsender cannot read the WAL file because of ENOENT, it sends the
special message indicating that error to walreceiver. This message is
shipped on the COPY protocol.

(2-a) If the message arrives, walreceiver exits by using proc_exit().
(3-a) If the startup process detects the exit of walreceiver in
WaitNextXLogAvailable(),
it switches back to a normal archive recovery mode, closes
the currently opened
WAL file, resets some variables (readId, readSeg, etc), and
calls FetchRecord()
again. Then it tries to restore the WAL file from the
archive if the restore_command
is supplied, and switches to a streaming recovery mode again
if invalid WAL is
found.

(2-b) If the message arrives, walreceiver executes restore_command,
and then sets
the receivedUpto to the end location of the restored WAL
file. The restored file is
expected to be filled because it doesn't exist in the
primary's pg_xlog. So that
update of the receivedUpto is OK.
(3-b) After one WAL file is restored, walreceiver tries to connect to
the primary, and
starts replication again. If the ENOENT error occurs again,
we go back to the (1).

I like the latter approach since it's simpler. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 16:43:51
Message-ID:	3f0b79eb1001140843r42df0b9bt765991a048181f07@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 15, 2010 at 1:06 AM, Dimitri Fontaine
<dfontaine(at)hi-media(dot)com> wrote:
> Did I mention my viewpoint on that already?
> http://archives.postgresql.org/pgsql-hackers/2009-07/msg00943.php

> 0. base: slave asks the master for a base-backup, at the end of this it
> reaches the base-lsn

What if the WAL file including the archive recovery starting location has
been removed from the primary's pg_xlog before the end of online-backup
(i.e., the procedure 0)? Where should the standby get such a WAL file from?
How?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 19:36:07
Message-ID:	m2ska8zdg8.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Fri, Jan 15, 2010 at 1:06 AM, Dimitri Fontaine <dfontaine(at)hi-media(dot)com> wrote:
>> 0. base: slave asks the master for a base-backup, at the end of this it
>> reaches the base-lsn
>
> What if the WAL file including the archive recovery starting location has
> been removed from the primary's pg_xlog before the end of online-backup
> (i.e., the procedure 0)? Where should the standby get such a WAL file from?
> How?

I guess it would be perfectly sensible for 8.5, given the timeframe, to
not implement this as part of SR, but tell our users they need to make a
base backup themselves.

If after that the first WAL we need from the master ain't available, 8.5
SR should maybe only issue an ERROR with a HINT explaining how to ensure
not running in the problem when trying again.

But how we handle failures when transitioning from one state to the
other should be a lot easier to discuss and decide as soon as we have
the possible states and the transitions we want to allow and support. I
think.

My guess is that those states and transitions are in the code, but not
explicit, so that each time we talk about how to handle the error cases
we have to be extra verbose and we risk not talking about exactly the
same thing. Naming the states should make those arrangements easier, I
should think. Not sure if it would help follow the time constraint now
though.

Regards,
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 20:13:46
Message-ID:	603c8f071001141213p6926df04xc4e5ac489c54ec04@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 14, 2010 at 10:23 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I wasn't really asking if it's possible to fix, I meant "Let's think
> about *how* to fix that".

Well... maybe if it doesn't require too MUCH thought.

I'm thinking that HS+SR are going to be a bit like the Windows port -
they're going to require a few releases before they really work as
well as we'd like them too.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 21:02:50
Message-ID:	29569.1263502970@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm thinking that HS+SR are going to be a bit like the Windows port -
> they're going to require a few releases before they really work as
> well as we'd like them too.

I've assumed that from the get-go ;-). It's one of the reasons that
we ought to label this release 9.0 if those features get in. Such a
number would help clue folks that there might be some less than
entirely stable things about it.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-14 22:19:48
Message-ID:	4B4F9884.3090009@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Fri, Jan 15, 2010 at 12:23 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> If we don't fix that within the server, we will need to document that
>> caveat and every installation will need to work around that one way or
>> another. Maybe with some monitoring software and an automatic restart. Ugh.
>>
>> I wasn't really asking if it's possible to fix, I meant "Let's think
>> about *how* to fix that".
>
> OK. How about the following (though it's a rough design)?
>
> (1) If walsender cannot read the WAL file because of ENOENT, it sends the
> special message indicating that error to walreceiver. This message is
> shipped on the COPY protocol.
>
> (2-a) If the message arrives, walreceiver exits by using proc_exit().
> (3-a) If the startup process detects the exit of walreceiver in
> WaitNextXLogAvailable(),
> it switches back to a normal archive recovery mode, closes
> the currently opened
> WAL file, resets some variables (readId, readSeg, etc), and
> calls FetchRecord()
> again. Then it tries to restore the WAL file from the
> archive if the restore_command
> is supplied, and switches to a streaming recovery mode again
> if invalid WAL is
> found.
>
> Or
>
> (2-b) If the message arrives, walreceiver executes restore_command,
> and then sets
> the receivedUpto to the end location of the restored WAL
> file. The restored file is
> expected to be filled because it doesn't exist in the
> primary's pg_xlog. So that
> update of the receivedUpto is OK.
> (3-b) After one WAL file is restored, walreceiver tries to connect to
> the primary, and
> starts replication again. If the ENOENT error occurs again,
> we go back to the (1).
>
> I like the latter approach since it's simpler. Thought?

Hmm. Executing restore_command in walreceiver process doesn't feel right
somehow. I'm thinking of:

Let's introduce a new boolean variable in shared memory that the
walreceiver can set to tell startup process if it's connected or
streaming, or disconnected. When startup process sees that walreceiver
is connected, it waits for receivedUpto to advance. Otherwise, it polls
the archive using restore_command.

To actually implement that requires some refactoring of the
ReadRecord/FetchRecord logic in xlog.c. However, it always felt a bit
hacky to me anyway, so that's not necessary a bad thing.

Now, one problem with this is that under the right conditions,
walreceiver might just succeed to reconnect, while the startup process
starts to restore the file from archive. That's OK, the streamed file
will be simply ignored, and the file restored from archive uses a
temporary filename that won't clash with the streamed file, but it feels
a bit strange to have the same file copied to the server via both
mechanisms.

See the "replication-xlogrefactor" branch in my git repository for a
prototype of that. We could also combine that with your 1st design, and
add the special message to indicate "WAL already deleted", and change
the walreceiver restart logic as you suggested. Some restructuring of
Read/FetchRecord is probably required for that anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-15 02:38:28
Message-ID:	3f0b79eb1001141838u1518eb0fm46a854c5d81b0051@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 15, 2010 at 7:19 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Let's introduce a new boolean variable in shared memory that the
> walreceiver can set to tell startup process if it's connected or
> streaming, or disconnected. When startup process sees that walreceiver
> is connected, it waits for receivedUpto to advance. Otherwise, it polls
> the archive using restore_command.

Seems OK.

> See the "replication-xlogrefactor" branch in my git repository for a
> prototype of that. We could also combine that with your 1st design, and
> add the special message to indicate "WAL already deleted", and change
> the walreceiver restart logic as you suggested. Some restructuring of
> Read/FetchRecord is probably required for that anyway.

Though I haven't read your branch much yet, there seems to be a corner
case which a partially-filled WAL file might be restored wrongly, which
would cause a PANIC error. So the primary should tell the last WAL file
which has been filled completely. And when that file has been restored
in the standby, the startup process should stop restoring any more files,
and try to wait for streaming again.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-15 18:11:00
Message-ID:	4B50AFB4.4060902@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine wrote:
> But how we handle failures when transitioning from one state to the
> other should be a lot easier to discuss and decide as soon as we have
> the possible states and the transitions we want to allow and support. I
> think.
>
> My guess is that those states and transitions are in the code, but not
> explicit, so that each time we talk about how to handle the error cases
> we have to be extra verbose and we risk not talking about exactly the
> same thing. Naming the states should make those arrangements easier, I
> should think. Not sure if it would help follow the time constraint now
> though.

I agree, a state machine is a useful way of thinking about this. I
recall that mail of yours from last summer :-).

The states we have at the moment in standby are:

1. Archive recovery. Standby fetches WAL files from archive using
restore_command. When a file is not found in archive, we switch to state 2

2. Streaming replication. Standby connects (and reconnects if the
connection is lost for any reason) to the primary, starts streaming, and
applies WAL as it arrives. We stay in this state until trigger file is
found or server is shut down.

The states with my suggested ReadRecord/FetchRecord refactoring, the
code I have in the replication-xlogrefactor branch in my git repo, are:

1. Initial archive recovery. Standby fetches WAL files from archive
using restore_command. When a file is not found in archive, we start
walreceiver and switch to state 2

2. Retrying to restore from archive. When the connection to primary is
established and replication is started, we switch to state 3

3. Streaming replication. Connection to primary is established, and WAL
is applied as it arrives. When the connection is dropped, we go back to
state 2

Although the the state transitions between 2 and 3 are a bit fuzzy in
that version; walreceiver runs concurrently, trying to reconnect, while
startup process retries restoring from archive. Fujii-san's suggestion
to have walreceiver stop while startup process retries restoring from
archive (or have walreceiver run restore_command in approach #2) would
make that clearer.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-15 18:56:41
Message-ID:	1263581801.26654.37627.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-01-15 at 20:11 +0200, Heikki Linnakangas wrote:

> The states we have at the moment in standby are:
>
> 1. Archive recovery. Standby fetches WAL files from archive using
> restore_command. When a file is not found in archive, we switch to state 2
>
> 2. Streaming replication. Standby connects (and reconnects if the
> connection is lost for any reason) to the primary, starts streaming, and
> applies WAL as it arrives. We stay in this state until trigger file is
> found or server is shut down.

> The states with my suggested ReadRecord/FetchRecord refactoring, the
> code I have in the replication-xlogrefactor branch in my git repo, are:
>
> 1. Initial archive recovery. Standby fetches WAL files from archive
> using restore_command. When a file is not found in archive, we start
> walreceiver and switch to state 2
>
> 2. Retrying to restore from archive. When the connection to primary is
> established and replication is started, we switch to state 3
>
> 3. Streaming replication. Connection to primary is established, and WAL
> is applied as it arrives. When the connection is dropped, we go back to
> state 2
>
> Although the the state transitions between 2 and 3 are a bit fuzzy in
> that version; walreceiver runs concurrently, trying to reconnect, while
> startup process retries restoring from archive. Fujii-san's suggestion
> to have walreceiver stop while startup process retries restoring from
> archive (or have walreceiver run restore_command in approach #2) would
> make that clearer.

The one-way state transitions between 1->2 in both cases seem to make
this a little more complex, rather than more simple.

If the connection did drop then WAL will be in the archive, so the path
for data is archive->primary->standby. There already needs to be a
network path between archive and standby, so why not drop back from
state 3 -> 1 rather than from 3 -> 2? That way we could have just 2
states on each side, rather than 3.

--
Simon Riggs www.2ndQuadrant.com

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-16 10:36:19
Message-ID:	m2ljfyqqu4.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for stating it this way, it really helps figuring out what is it
we're talking about!

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> The states with my suggested ReadRecord/FetchRecord refactoring, the
> code I have in the replication-xlogrefactor branch in my git repo,
> are:

They look like you're trying to solve a specific issue that is a
consequence of another one, without fixing the cause. I hope I'm wrong,
once more :)

> 1. Initial archive recovery. Standby fetches WAL files from archive
> using restore_command. When a file is not found in archive, we start
> walreceiver and switch to state 2
>
> 2. Retrying to restore from archive. When the connection to primary is
> established and replication is started, we switch to state 3

When do the master know about this new slave being there? I'd say not
until 3 is ok, and then, the actual details between 1 and 2 look
strange, partly because it's more about processes than states.

I'd propose to have 1 and 2 started in parallel from the beginning, and
as Simon proposes, being able to get back to 1. at any time:

0. start from a base backup, determine the first WAL / LSN we need to
start streaming, call it SR_LSN. That means asking the master its
current xlog location. The LSN we're at now, after replaying the base
backup and maybe the initial recovery from local WAL files, let's
call it BASE_LSN.

1. Get the missing WAL to get from BASE_LSN to SR_LSN from the archive,
with restore_command, apply them as we receive them, and start
2. possibly in parallel

2. Streaming replication: we connect to the primary and walreceiver gets
the WALs from the connection. It either stores them if current
standby's position < SR_LSN or apply them directly if we were already
streaming.

Local storage would be either standby's archiving or a specific
temporary location. I guess it's more or less what you want to do
with retrying from the master's archives, but I'm not sure your line
of though makes it simpler.

But that's more a process view, not a state view. As 1 and 2 run in
parallel, we're missing some state names. Let's name the states now that
we have the processes.

base: start from a base backup, which we don't know how we got it

catch-up: getting the WALs [from archive] to get from base to being able
to apply the streaming

wanna-sync: receiving primary's wal while not being able to replay them

do-sync: applying the wals we got in wanna-sync state

sync: replaying what's being sent as it arrives

So the current problem is what happens when we're not able to start
streaming from the primary, yet, or again. And your question is how will
it get simpler with all those details.

What I propose is to always have a walreceiver running and getting WALs
from the master. Depending on current state it's applying them (sync) or
keeping them for later (wanna-sync). We need some more code for it to
apply WALs it's been keeping for later (do-sync), that depends on how we
keep the WALs.

Your problem is getting out of catch-up up to sync, and which process is
doing what in between. I hope to make it clear to think about with my
proposal, and would go as far as to say that the startup process does
only care about getting the WALs from BASE_LSN to SR_LSN, that's called
catch-up.

Having another process to handle wanna-sync is neat, but can be
sequential too.

When you lose the connection, you get out of sync back to another state
depending on missing wals, so to know that you need to contact the
primary again.

The master only considers any standby's in sync if its walsender process
is up-to-date or lagging only the last emitted WAL. If lagging more,
that means the standby's is catching up, or replaying more than the
current WAL, so in wanna-sync or do-sync state. Not in sync.

The details about when a slave is in sync will get more important as
soon as we have synchronous streaming.

Regards,
--
dim

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-20 19:26:37
Message-ID:	4B5758ED.1060703@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> 1. Initial archive recovery. Standby fetches WAL files from archive
>> using restore_command. When a file is not found in archive, we start
>> walreceiver and switch to state 2
>>
>> 2. Retrying to restore from archive. When the connection to primary is
>> established and replication is started, we switch to state 3
>
> When do the master know about this new slave being there? I'd say not
> until 3 is ok, and then, the actual details between 1 and 2 look
> strange, partly because it's more about processes than states.

Right. The master doesn't need to know about the slave.

> I'd propose to have 1 and 2 started in parallel from the beginning, and
> as Simon proposes, being able to get back to 1. at any time:
>
> 0. start from a base backup, determine the first WAL / LSN we need to
> start streaming, call it SR_LSN. That means asking the master its
> current xlog location.

What if the master can't be contacted?

> The LSN we're at now, after replaying the base
> backup and maybe the initial recovery from local WAL files, let's
> call it BASE_LSN.
>
> 1. Get the missing WAL to get from BASE_LSN to SR_LSN from the archive,
> with restore_command, apply them as we receive them, and start
> 2. possibly in parallel
>
> 2. Streaming replication: we connect to the primary and walreceiver gets
> the WALs from the connection. It either stores them if current
> standby's position < SR_LSN or apply them directly if we were already
> streaming.
>
> Local storage would be either standby's archiving or a specific
> temporary location. I guess it's more or less what you want to do
> with retrying from the master's archives, but I'm not sure your line
> of though makes it simpler.

Seems complicated...

> <snip>
> The details about when a slave is in sync will get more important as
> soon as we have synchronous streaming.

Yeah, a lot of that logic and states is completely unnecessary until we
have a synchronous mode. Even then, it seems complex.

Here's what I've been hacking:

First of all, walreceiver no longer tries to retry the connection on
error, and postmaster no longer tries to relaunch it if it dies. So when
Walreceiver is launched, it tries to connect once, and if successful,
streams until an error occurs or it's killed.

When startup process needs more WAL to continue replay, the logic is in
pseudocode:

while (<need more wal>)
{
if(<walreceiver is alive>)
{
wait for WAL to arrive, or for walreceiver to die.
}
else
{
Run restore_command
If (restore_command succeeded)
break;
else
{
Sleep 5 seconds
Start walreceiver
}
}
}

So there's just two states:

1. Recovering from archive
2. Streaming

We start from 1, and switch state at error.

This gives nice behavior from a user point of view. Standby tries to
make progress using either the archive or streaming, whichever becomes
available first.

Attached is a WIP patch implementing that, also available in the
'replication-xlogrefactor' branch in my git repository. It includes the
Read/FetchRecord refactoring I mentioned earlier; that's a pre-requisite
for this.

The code implementing the above retry logic in XLogReadPage(), in xlog.c.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
sr-retry-refactor-1.patch	text/x-diff	47.2 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-20 23:32:11
Message-ID:	1264030331.4043.5666.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-01-20 at 21:26 +0200, Heikki Linnakangas wrote:

> So there's just two states:
>
> 1. Recovering from archive
> 2. Streaming
>
> We start from 1, and switch state at error.
>
> This gives nice behavior from a user point of view. Standby tries to
> make progress using either the archive or streaming, whichever becomes
> available first.

Sounds good. Easier to drive if we have two gears.

--
Simon Riggs www.2ndQuadrant.com

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-21 22:09:53
Message-ID:	m2zl47jej2.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Yeah, a lot of that logic and states is completely unnecessary until we
> have a synchronous mode. Even then, it seems complex.

I hope we'll find something less complex, what I proposed is heavily
inspired from londiste (Skytools) table addition to a replication set
(parallel COPY), which works fine.

> Here's what I've been hacking:
[...]
> So there's just two states:
>
> 1. Recovering from archive
> 2. Streaming
>
> We start from 1, and switch state at error.

Oh yes that's even more simple!

> This gives nice behavior from a user point of view. Standby tries to
> make progress using either the archive or streaming, whichever becomes
> available first.

So tools like pitrtools or walmgr.py will certainly continue being
necessary to use in 9.0, right?

--
dim

From:	Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Streaming replication, retrying from archive
Date:	2010-01-21 22:15:25
Message-ID:	4B58D1FD.3080304@catalyst.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>
>> Yeah, a lot of that logic and states is completely unnecessary until we
>> have a synchronous mode. Even then, it seems complex.
>>
>
> I hope we'll find something less complex, what I proposed is heavily
> inspired from londiste (Skytools) table addition to a replication set
> (parallel COPY), which works fine.
>
>
>> Here's what I've been hacking:
>>
> [...]
>
>> So there's just two states:
>>
>> 1. Recovering from archive
>> 2. Streaming
>>
>> We start from 1, and switch state at error.
>>
>
> Oh yes that's even more simple!
>
>
>> This gives nice behavior from a user point of view. Standby tries to
>> make progress using either the archive or streaming, whichever becomes
>> available first.
>>
>
> So tools like pitrtools or walmgr.py will certainly continue being
> necessary to use in 9.0, right?
>
>
Right now Streaming Replication does not do your backup for you, which
some of the tools (e.g walmgr.py) do... Thinking about walmgr.py for a
moment - it should be pretty easy (or possible anyway) to make it use
streaming replication for server versions >= 9.0.

Cheers

Mark