Quick Links

Re: Fast promotion failure

Lists:	pgsql-hackers

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Fast promotion failure
Date:	2013-05-07 09:57:14
Message-ID:	5188CFFA.3020209@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

While testing the bug from the "Assertion failure at standby promotion",
I bumped into a different bug in fast promotion. When the first
checkpoint after fast promotion is performed, there is no guarantee that
the checkpointer process is running with the correct, new,
ThisTimeLineID. In CreateCheckPoint(), we have this:

> /*
> * An end-of-recovery checkpoint is created before anyone is allowed to
> * write WAL. To allow us to write the checkpoint record, temporarily
> * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
> * initialized, which we need here and in AdvanceXLInsertBuffer.)
> */
> if (flags & CHECKPOINT_END_OF_RECOVERY)
> LocalSetXLogInsertAllowed();

That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast
promotion. The consequence is that the checkpoint is created with the
old timeline, and subsequent recovery from it will fail.

I ran into this with the attached script. It sets up a master (M), a
standby (B), and a cascading standby (C). I'm not sure why, but when I
tried to simplify the script by removing the cascading standby, it
started to work. The bug occurs in standby B, so I'm not sure why the
presence of the cascading standby makes any difference. Maybe it just
affects the timing.

- Heikki

Attachment	Content-Type	Size
fast-promotion-bug.sh	application/x-shellscript	1.8 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-09 00:59:11
Message-ID:	CAHGQGwEhs36HoVthw2ts7bDTVSO-ZjhSFBm9R2_DcAjH3wAMtw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> While testing the bug from the "Assertion failure at standby promotion", I
> bumped into a different bug in fast promotion. When the first checkpoint
> after fast promotion is performed, there is no guarantee that the
> checkpointer process is running with the correct, new, ThisTimeLineID. In
> CreateCheckPoint(), we have this:
>
>> /*
>> * An end-of-recovery checkpoint is created before anyone is
>> allowed to
>> * write WAL. To allow us to write the checkpoint record,
>> temporarily
>> * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
>> * initialized, which we need here and in AdvanceXLInsertBuffer.)
>> */
>> if (flags & CHECKPOINT_END_OF_RECOVERY)
>> LocalSetXLogInsertAllowed();
>
>
> That ensures that ThisTimeLineID is updated when performing an
> end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
> The consequence is that the checkpoint is created with the old timeline, and
> subsequent recovery from it will fail.
>
> I ran into this with the attached script. It sets up a master (M), a standby
> (B), and a cascading standby (C). I'm not sure why, but when I tried to
> simplify the script by removing the cascading standby, it started to work.
> The bug occurs in standby B, so I'm not sure why the presence of the
> cascading standby makes any difference. Maybe it just affects the timing.

Can this really happen? ISTM that the checkpointer should detect that
the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
before calling CreateCheckPoint().

Regards,

--
Fujii Masao

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc:	"'PostgreSQL-development'" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-09 05:40:23
Message-ID:	005501ce4c77$b45f0b70$1d1d2250$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, May 09, 2013 6:29 AM Fujii Masao wrote:
> On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > While testing the bug from the "Assertion failure at standby
> promotion", I
> > bumped into a different bug in fast promotion. When the first
> checkpoint
> > after fast promotion is performed, there is no guarantee that the
> > checkpointer process is running with the correct, new,
> ThisTimeLineID. In
> > CreateCheckPoint(), we have this:
> >
> >> /*
> >> * An end-of-recovery checkpoint is created before anyone is
> >> allowed to
> >> * write WAL. To allow us to write the checkpoint record,
> >> temporarily
> >> * enable XLogInsertAllowed. (This also ensures
> ThisTimeLineID is
> >> * initialized, which we need here and in
> AdvanceXLInsertBuffer.)
> >> */
> >> if (flags & CHECKPOINT_END_OF_RECOVERY)
> >> LocalSetXLogInsertAllowed();
> >
> >
> > That ensures that ThisTimeLineID is updated when performing an
> > end-of-recovery checkpoint, but it doesn't get executed with fast
> promotion.
> > The consequence is that the checkpoint is created with the old
> timeline, and
> > subsequent recovery from it will fail.
> >
> > I ran into this with the attached script. It sets up a master (M), a
> standby
> > (B), and a cascading standby (C). I'm not sure why, but when I tried
> to
> > simplify the script by removing the cascading standby, it started to
> work.
> > The bug occurs in standby B, so I'm not sure why the presence of the
> > cascading standby makes any difference. Maybe it just affects the
> timing.
>
> Can this really happen? ISTM that the checkpointer should detect that
> the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
> before calling CreateCheckPoint().

Without fast-promotion, it will request/perform End of Recovery checkpoint
while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
So I think before any new operation can start, it can make sure that
Checkpoint with new timeline is performed.

However with fast promotion, the request for checkpoint is done after
recovery; so some operations can happen before checkpoint with new timeline.
I think it can so happen that last checkpoint is with old timeline and there
are operations with new timeline which might have caused the problem Heikki
has seen.

With Regards,
Amit Kapila.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	amit(dot)kapila(at)huawei(dot)com
Cc:	masao(dot)fujii(at)gmail(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-09 08:06:26
Message-ID:	20130509.170626.263454661.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> I think it can so happen that last checkpoint is with old timeline and there
> are operations with new timeline which might have caused the problem Heikki
> has seen.

I suppose to have seen that.

After adding an SQL command to modify the DB on standby-B after
passive(propagated?) promotion of standby-C,

< pg_ctl -D data-standbyB promote
---
>
> pg_ctl -D data-standbyB -w promote
> sleep 20
> psql -p 5433 postgres -c 'create table t (a int); checkpoint;'

I saw following complaint from standby-C.

> B 2013-05-09 17:00:06.019 JST 30268 LOG: checkpoint starting: immediate force wait
> C 2013-05-09 17:00:06.211 JST 30277 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
> C 2013-05-09 17:00:06.211 JST 30295 FATAL: terminating walreceiver process due to administrator command
> B 2013-05-09 17:00:06.219 JST 30268 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.157 s, total=0.199 s; sync files=13, longest=0.041 s, average=0.012 s
> CHECKPOINT
> C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
> C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
> C 2013-05-09 16:41:38.980 JST 29624 LOG: record with zero length at 0/53F8B90
.. repeats forever ..

This seems to be caused from a kind of timeline inconsistency at
first glance.. Explicit checkpoint before the modification on
Sby-B does not help.

> psql -p 5433 postgres -c 'checkpoint;create table t (a int); checkpoint;'

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	amit(dot)kapila(at)huawei(dot)com
Cc:	masao(dot)fujii(at)gmail(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-09 08:44:08
Message-ID:	20130509.174408.198415395.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

With printing some additinal logs, the situation should be more
clear..

It seems that Sby-B failes to promote to TLI= 2; nevertheless the
history file for TLI = 2 is somehow sent to sby-C. So sby-B
remains on TLI=1 but sby-C solely switches onto TLI=2.

# Come to think of this, I suspect that the additional logs is not so useful :(

> B 2013-05-09 17:29:53.380 JST 32258 ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000
> C 2013-05-09 17:29:53.380 JST 32257 FATAL: could not receive data from WAL stream: ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000
>
> B 2013-05-09 17:29:53.380 JST 32244 LOG: database system is ready to accept connections
..
> C 2013-05-09 17:30:08.395 JST 32256 LOG: Reading page on Timeline ID = 1
> C 2013-05-09 17:30:08.398 JST 32274 LOG: fetching timeline history file for timeline 2 from primary server
> C 2013-05-09 17:30:08.448 JST 32274 LOG: started streaming WAL from primary at 0/5000000 on timeline 1
> C 2013-05-09 17:30:08.452 JST 32274 LOG: replication terminated by primary server
> C 2013-05-09 17:30:08.452 JST 32274 DETAIL: End of WAL reached on timeline 1 at 0/53F8B60
> C 2013-05-09 17:30:08.452 JST 32256 LOG: new target timeline is 2
> C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
> C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
> C 2013-05-09 17:30:08.453 JST 32274 LOG: restarted WAL streaming at 0/5000000 on timeline 2
> B 2013-05-09 17:30:10.913 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8C30
> B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint complete: wrote 637 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=13.502 s, sync=0.105 s, total=13.733 s; sync files=2, longest=0.089 s, average=0.052 s
> B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint starting: immediate force wait
> B 2013-05-09 17:30:10.963 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8CD0
> B 2013-05-09 17:30:11.003 JST 32248 LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.049 s; sync files=0, longest=0.000 s, average=0.000 s
> B 2013-05-09 17:30:11.096 JST 32248 LOG: checkpoint starting: immediate force wait
> B 2013-05-09 17:30:11.909 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/540BEF8
> C 2013-05-09 17:30:11.929 JST 32256 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
> C 2013-05-09 17:30:11.929 JST 32274 FATAL: terminating walreceiver process due to administrator command
> B 2013-05-09 17:30:11.951 JST 32248 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.017 s, sync=0.785 s, total=0.855 s; sync files=13, longest=0.235 s, average=0.060 s
> CHECKPOINT
> C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
> C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
> C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
> C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
> C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
> C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
> C 2013-05-09 17:30:18.936 JST 32256 LOG: record with zero length at 0/53F8B90

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	<masao(dot)fujii(at)gmail(dot)com>, <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-09 13:08:36
Message-ID:	001001ce4cb6$51ca14c0$f55e3e40$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, May 09, 2013 2:14 PM Kyotaro HORIGUCHI wrote:
> With printing some additinal logs, the situation should be more
> clear..
>
> It seems that Sby-B failes to promote to TLI= 2; nevertheless the
> history file for TLI = 2 is somehow sent to sby-C. So sby-B
> remains on TLI=1 but sby-C solely switches onto TLI=2.
>
> # Come to think of this, I suspect that the additional logs is not so
> useful :(

It seems to me, it is the same problem as discussed and fixed in below
thread.
http://www.postgresql.org/message-id/51894942.4080500@vmware.com

Could you try with fixes given by heikki.

With Regards,
Amit Kapila.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	amit(dot)kapila(at)huawei(dot)com
Cc:	masao(dot)fujii(at)gmail(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-10 08:36:55
Message-ID:	20130510.173655.159816643.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for noticing me of that.

> It seems to me, it is the same problem as discussed and fixed in below
> thread.
> http://www.postgresql.org/message-id/51894942.4080500@vmware.com
>
> Could you try with fixes given by heikki.

The first one settles the timeline transition problem for the
present. Besides, I have no longer found any recovery failure
except unrepeatable one shown below.

> C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at 0/5000000 on timeline 2
> B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete: (snip.)
> B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at 0/3000028
> B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
> C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data from WAL stream:

I'm get confused, the patch seems to me ensureing the "first
checkpoint after fast promotion is performed" to use the
"correct, new, ThisTimeLineID".

At Thu, 9 May 2013 11:10:23 +0530, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote in <005501ce4c77$b45f0b70$1d1d2250$(at)kapila@huawei.com>
> Without fast-promotion, it will request/perform End of Recovery checkpoint
> while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
> So I think before any new operation can start, it can make sure that
> Checkpoint with new timeline is performed.
>
> However with fast promotion, the request for checkpoint is done after
> recovery; so some operations can happen before checkpoint with new timeline.
> I think it can so happen that last checkpoint is with old timeline and there
> are operations with new timeline which might have caused the problem Heikki
> has seen.

I don't see any problem (in my view :) that something writes WAL
runs before the first checkpoint starts - as described in the
comment just above the checkpoint request..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	<masao(dot)fujii(at)gmail(dot)com>, <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-10 11:00:40
Message-ID:	006e01ce4d6d$9c8503a0$d58f0ae0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday, May 10, 2013 2:07 PM Kyotaro HORIGUCHI wrote:
> Thank you for noticing me of that.
>
> > It seems to me, it is the same problem as discussed and fixed in
> below
> > thread.
> > http://www.postgresql.org/message-id/51894942.4080500@vmware.com
> >
> > Could you try with fixes given by heikki.
>
> The first one settles the timeline transition problem for the
> present. Besides, I have no longer found any recovery failure
> except unrepeatable one shown below.
>
> > C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at
> 0/5000000 on timeline 2
> > B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete:
> (snip.)
> > B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at
> 0/3000028
> > B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
> > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
> from WAL stream:

Is there any chance, that there is any network glitch caused this one time
error.

> I'm get confused, the patch seems to me ensureing the "first
> checkpoint after fast promotion is performed" to use the
> "correct, new, ThisTimeLineID".

What is your confusion?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	amit(dot)kapila(at)huawei(dot)com
Cc:	masao(dot)fujii(at)gmail(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-13 00:23:52
Message-ID:	20130513.092352.30755878.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/05/10 20:01 "Amit Kapila" <amit(dot)kapila(at)huawei(dot)com>:
> > > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
> > from WAL stream:
>
> Is there any chance, that there is any network glitch caused this one time
> error.

Unix domam sockets are hardly likely to have such troubles. This
test ran within single host.

> > I'm get confused, the patch seems to me ensureing the "first
> > checkpoint after fast promotion is performed" to use the
> > "correct, new, ThisTimeLineID".
>
> What is your confusion?

Heikki said in the fist message in this thread that he suspected
the cause of the failure he had seen to be wrong TLI on whitch
checkpointer runs. Nevertheless, the patch you suggested for me
looks fixing it. Moreover (one of?) the failure from the same
cause looks fixed with the patch.

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	<masao(dot)fujii(at)gmail(dot)com>, <hlinnakangas(at)vmware(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-13 03:07:27
Message-ID:	006801ce4f87$005569e0$01003da0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
> 2013/05/10 20:01 "Amit Kapila" <amit(dot)kapila(at)huawei(dot)com>:
> > > > C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data
> > > from WAL stream:
> >
> > Is there any chance, that there is any network glitch caused this one
> time
> > error.
>
> Unix domam sockets are hardly likely to have such troubles. This
> test ran within single host.
>
> > > I'm get confused, the patch seems to me ensureing the "first
> > > checkpoint after fast promotion is performed" to use the
> > > "correct, new, ThisTimeLineID".
> >
> > What is your confusion?
>
> Heikki said in the fist message in this thread that he suspected
> the cause of the failure he had seen to be wrong TLI on whitch
> checkpointer runs. Nevertheless, the patch you suggested for me
> looks fixing it. Moreover (one of?) the failure from the same
> cause looks fixed with the patch.

There were 2 problems:
1. There was some issue in walsender logic due to which after promotion in
some cases it hits assertion or error
2. During fast promotion, checkpoint gets created with wrong TLI

He has provided 2 different patches
fix-standby-promotion-assert-fail-2.patch and
fast-promotion-quick-fix.patch.
Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
75c99e52984f7ee81b47d137b5b4751)

> Is the point of this discussion that the patch may leave out some
> glich about timing of timeline-related changing and Heikki saw an
> egress of that?

AFAIU, the committed patch has some gap in overall scenario which is the
fast promotion issue.

With Regards,
Amit Kapila.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	'Kyotaro HORIGUCHI' <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-13 07:43:20
Message-ID:	51909998.3010202@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13.05.2013 06:07, Amit Kapila wrote:
> On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
>> Heikki said in the fist message in this thread that he suspected
>> the cause of the failure he had seen to be wrong TLI on whitch
>> checkpointer runs. Nevertheless, the patch you suggested for me
>> looks fixing it. Moreover (one of?) the failure from the same
>> cause looks fixed with the patch.
>
> There were 2 problems:
> 1. There was some issue in walsender logic due to which after promotion in
> some cases it hits assertion or error
> 2. During fast promotion, checkpoint gets created with wrong TLI
>
> He has provided 2 different patches
> fix-standby-promotion-assert-fail-2.patch and
> fast-promotion-quick-fix.patch.
> Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
> (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
> 75c99e52984f7ee81b47d137b5b4751)

That's correct.

>> Is the point of this discussion that the patch may leave out some
>> glich about timing of timeline-related changing and Heikki saw an
>> egress of that?
>
> AFAIU, the committed patch has some gap in overall scenario which is the
> fast promotion issue.

Right, the fast promotion issue is still there.

Just to get us all on the same page again: Does anyone see a problem
with a fresh git checkout, with the fast-promotion-quick-fix.patch
applied?
(http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
you do, please speak up. As far as I know, the already-committed patch,
together with fast-promotion-quick-fix.patch, should fix all known
issues (*).

I haven't committed a fix for the issue I reported in this thread,
because I'm not 100% on what the right fix for it would be.
fast-promotion-quick-fix.patch seems to do the trick, but at least the
comments need to be updated, and I'm not sure if there some related
corner cases that it doesn't handle. Simon?

(*) Well, almost. This one is still pending:
http://www.postgresql.org/message-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com

- Heikki

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>, "'Simon Riggs'" <simon(at)2ndquadrant(dot)com>
Cc:	"'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, <masao(dot)fujii(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-13 09:41:34
Message-ID:	007201ce4fbe$0fa8bbe0$2efa33a0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, May 13, 2013 1:13 PM Heikki Linnakangas wrote:
> On 13.05.2013 06:07, Amit Kapila wrote:
> > On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:
> >> Heikki said in the fist message in this thread that he suspected
> >> the cause of the failure he had seen to be wrong TLI on whitch
> >> checkpointer runs. Nevertheless, the patch you suggested for me
> >> looks fixing it. Moreover (one of?) the failure from the same
> >> cause looks fixed with the patch.
> >
> > There were 2 problems:
> > 1. There was some issue in walsender logic due to which after
> promotion in
> > some cases it hits assertion or error
> > 2. During fast promotion, checkpoint gets created with wrong TLI
> >
> > He has provided 2 different patches
> > fix-standby-promotion-assert-fail-2.patch and
> > fast-promotion-quick-fix.patch.
> > Among 2, he has already committed fix-standby-promotion-assert-fail-
> 2.patch
> >
> (http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa
> 66f49
> > 75c99e52984f7ee81b47d137b5b4751)
>
> That's correct.
>
> >> Is the point of this discussion that the patch may leave out some
> >> glich about timing of timeline-related changing and Heikki saw an
> >> egress of that?
> >
> > AFAIU, the committed patch has some gap in overall scenario which is
> the
> > fast promotion issue.
>
> Right, the fast promotion issue is still there.
>
> Just to get us all on the same page again: Does anyone see a problem
> with a fresh git checkout, with the fast-promotion-quick-fix.patch
> applied?
> (http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
> you do, please speak up. As far as I know, the already-committed patch,
> together with fast-promotion-quick-fix.patch, should fix all known
> issues (*).
>
> I haven't committed a fix for the issue I reported in this thread,
> because I'm not 100% on what the right fix for it would be.
> fast-promotion-quick-fix.patch seems to do the trick, but at least the
> comments need to be updated, and I'm not sure if there some related
> corner cases that it doesn't handle. Simon?

The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
of recovery checkpoint, it doesn't matter w.r.t performance but actually the
purpose will
be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
or am I missing something.

One more thing, I think after fast promotion, either it should set timeline
or give error in CreateCheckPoint() function before it reaches the check
mentioned by you in your initial mail.
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
elog(ERROR, "can't create a checkpoint during recovery");
Shouldn't it set timeline in above check (RecoveryInProgress()) or when
RecoveryInProgress() is called before CreateCheckPoint()?

With Regards,
Amit Kapila.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	amit(dot)kapila(at)huawei(dot)com
Cc:	hlinnakangas(at)vmware(dot)com, simon(at)2ndquadrant(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast promotion failure
Date:	2013-05-16 06:02:42
Message-ID:	20130516.150242.153333292.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> > >> Is the point of this discussion that the patch may leave out some
> > >> glich about timing of timeline-related changing and Heikki saw an
> > >> egress of that?
> > >
> > > AFAIU, the committed patch has some gap in overall scenario which is
> > the
> > > fast promotion issue.
> >
> > Right, the fast promotion issue is still there.
> >
> > Just to get us all on the same page again: Does anyone see a problem
> > with a fresh git checkout, with the fast-promotion-quick-fix.patch
> > applied?
> > (http://www.postgresql.org/message-id/51894942.4080500@vmware.com). If
> > you do, please speak up. As far as I know, the already-committed patch,
> > together with fast-promotion-quick-fix.patch, should fix all known
> > issues (*).

Shared XLogCtl->ThisTimeLineID is written and read without
fencing by spinlock unlike some other XLogCtl members. Can this
break coherency of its memory between different processors? It
is quite reasonable that I cannot find the trouble if it is the
cause. I didn't see the issue even without
fast-promotion-quick-fix.patch.

> The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
> of recovery checkpoint, it doesn't matter w.r.t performance but actually the
> purpose will
> be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
> or am I missing something.
>
> One more thing, I think after fast promotion, either it should set timeline
> or give error in CreateCheckPoint() function before it reaches the check
> mentioned by you in your initial mail.
> if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
> elog(ERROR, "can't create a checkpoint during recovery");
> Shouldn't it set timeline in above check (RecoveryInProgress()) or when
> RecoveryInProgress() is called before CreateCheckPoint()?

Thinking of checkpointer, it does RecoveryInProgress() far
earlier to there, in waiting loop in CheckPointerMain where to
decide which to do between checkpoint and restartpoint. So
InitXLogAccess() has been already done when checkpoint is choosed
there for the first time. And before that, ThisTimeLineID in the
startup process gets incremented and is copied onto
XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
becomes false in StartupXLog(). I don't think it is possible for
checkpointer to run on older timeline id on codition that all
processes looks exactly the same memory image.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-16 07:13:00
Message-ID:	CA+U5nM+U4i0d2gjYRw6OBCU_75W4TNRvpK_oU4xOjSMqmdzrEA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16 May 2013 07:02, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

>> > > fast promotion issue.

Excuse me for not joining the thread earlier. I'm not available today,
but will join in later in my evening.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Kyotaro HORIGUCHI'" <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	<hlinnakangas(at)vmware(dot)com>, <simon(at)2ndquadrant(dot)com>, <masao(dot)fujii(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-16 09:16:47
Message-ID:	008301ce5216$17fda4e0$47f8eea0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, May 16, 2013 11:33 AM Kyotaro HORIGUCHI wrote:
> Hello,
>
> > > >> Is the point of this discussion that the patch may leave out
> some
> > > >> glich about timing of timeline-related changing and Heikki saw
> an
> > > >> egress of that?
> > > >
> > > > AFAIU, the committed patch has some gap in overall scenario which
> is
> > > the
> > > > fast promotion issue.
> > >
> > > Right, the fast promotion issue is still there.
> > >
> > > Just to get us all on the same page again: Does anyone see a
> problem
> > > with a fresh git checkout, with the fast-promotion-quick-fix.patch
> > > applied?
> > > (http://www.postgresql.org/message-id/51894942.4080500@vmware.com).
> If
> > > you do, please speak up. As far as I know, the already-committed
> patch,
> > > together with fast-promotion-quick-fix.patch, should fix all known
> > > issues (*).
>
> Shared XLogCtl->ThisTimeLineID is written and read without
> fencing by spinlock unlike some other XLogCtl members. Can this
> break coherency of its memory between different processors? It
> is quite reasonable that I cannot find the trouble if it is the
> cause. I didn't see the issue even without
> fast-promotion-quick-fix.patch.
>
> > The patch provided will un-necessarily call InitXLOGAccess() 2 times
> for End
> > of recovery checkpoint, it doesn't matter w.r.t performance but
> actually the
> > purpose will
> > be almost same for calling LocalSetXLogInsertAllowed() and
> InitXLOGAccess(),
> > or am I missing something.
> >
> > One more thing, I think after fast promotion, either it should set
> timeline
> > or give error in CreateCheckPoint() function before it reaches the
> check
> > mentioned by you in your initial mail.
> > if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) ==
> 0)
> > elog(ERROR, "can't create a checkpoint during
> recovery");
> > Shouldn't it set timeline in above check (RecoveryInProgress()) or
> when
> > RecoveryInProgress() is called before CreateCheckPoint()?
>
> Thinking of checkpointer, it does RecoveryInProgress() far
> earlier to there, in waiting loop in CheckPointerMain where to
> decide which to do between checkpoint and restartpoint. So
> InitXLogAccess() has been already done when checkpoint is choosed
> there for the first time. And before that, ThisTimeLineID in the
> startup process gets incremented and is copied onto
> XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
> becomes false in StartupXLog(). I don't think it is possible for
> checkpointer to run on older timeline id on codition that all
> processes looks exactly the same memory image.

I also think the same, but now one difference with fast promotion is the
request for checkpoint is done after recovery; so some operations can happen
before checkpoint with new timeline.

With Regards,
Amit Kapila.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-19 14:25:08
Message-ID:	CA+U5nMKp9KvhKTCEgQK2MMRU8do9HJBj-jhorfNpsAepApzQEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7 May 2013 10:57, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> While testing the bug from the "Assertion failure at standby promotion", I
> bumped into a different bug in fast promotion. When the first checkpoint
> after fast promotion is performed, there is no guarantee that the
> checkpointer process is running with the correct, new, ThisTimeLineID. In
> CreateCheckPoint(), we have this:
>
>> /*
>> * An end-of-recovery checkpoint is created before anyone is
>> allowed to
>> * write WAL. To allow us to write the checkpoint record,
>> temporarily
>> * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
>> * initialized, which we need here and in AdvanceXLInsertBuffer.)
>> */
>> if (flags & CHECKPOINT_END_OF_RECOVERY)
>> LocalSetXLogInsertAllowed();
>
>
> That ensures that ThisTimeLineID is updated when performing an
> end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
> The consequence is that the checkpoint is created with the old timeline, and
> subsequent recovery from it will fail.

LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(),
but in fast promotion this is called by Startup process, not
Checkpointer process. So there is no longer an explicit call to
InitXLOGAccess() in the checkpointer process via a checkpoint with
flag CHECKPOINT_END_OF_RECOVERY.

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
"initializes TimeLineID if it's not set yet." The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this "doesn't get executed with fast promotion", or the title of
the thread.

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

I see that the comment in InitXLOGAccess() is incorrect
"ThisTimeLineID doesn't change so we need no lock to copy it", which
seems worth correcting since there's no need to save time in a once
only function.

Continuing to investigate further.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-20 17:47:06
Message-ID:	519A619A.7070609@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19.05.2013 17:25, Simon Riggs wrote:
> However, there is a call to RecoveryInProgress() at the top of the
> main loop of the checkpointer, which does explicitly state that it
> "initializes TimeLineID if it's not set yet." The checkpointer makes
> the decision about whether to run a restartpoint or a checkpoint
> directly from the answer to that, modified only in the case of an
> CHECKPOINT_END_OF_RECOVERY.
>
> So the appropiate call is made and I don't agree with the statement
> that this "doesn't get executed with fast promotion", or the title of
> the thread.

Hmm, I see.

> So while I believe that the checkpointer might have an incorrect TLI
> and that you've seen a bug, what isn't clear is that the checkpointer
> is the only process that would see an incorrect TLI, or why such
> processes see an incorrect TLI. It seems equally likely at this point
> that the TLI may be set incorrectly somehow and that is why it is
> being read incorrectly.

Yeah. I spent some more time debugging this, and found the culprit. In
CreateRestartPoint, we do this:

> /*
> * Update ThisTimeLineID to the timeline we're currently replaying,
> * so that we install any recycled segments on that timeline.
> *
> * There is no guarantee that the WAL segments will be useful on the
> * current timeline; if recovery proceeds to a new timeline right
> * after this, the pre-allocated WAL segments on this timeline will
> * not be used, and will go wasted until recycled on the next
> * restartpoint. We'll live with that.
> */
> (void) GetXLogReplayRecPtr(&ThisTimeLineID);

Here's what happens:

0. System is in recovery
1. checkpointer process starts creating a restartpoint.
2. Startup process ends recovery, creates end-of-recovery record, sets
the shared ThisTimeLineID value to 2, and exits.
3. checkpointer process calls RecoveryInProgess() while performing the
restartpoint (there are RecoveryInProgress() calls in many places, e.g
in XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and
initializes ThisTimeLineID to 2.
4. At the end of restartpoint, the checkpointer process does the above
GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID
back to 1.
5. Checkpointer calls RecoveryInProgress, which returns true
immediately, as LocalRecoveryInProgress is already set.
5. Checkpointer performs the first checkpoint, with ThisTimeLineID set
incorrectly to 1.

Not sure what the best fix would be. Perhaps change the code in
CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
ThisTimeLineID = replayTLI;

> I see that the comment in InitXLOGAccess() is incorrect
> "ThisTimeLineID doesn't change so we need no lock to copy it", which
> seems worth correcting since there's no need to save time in a once
> only function.

Hmm, the code seems right to me, XLogCtl->ThisTimeLineID indeed doesn't
change, once it's set. And InitXLOGAccess() isn't called until it is
set. The comment could explain that better, though.

- Heikki

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-20 19:18:12
Message-ID:	CA+U5nMJBhPRxg=UfLoyJ3S7jC1CMM-xsnZZb+_00wySGAFLG0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 May 2013 18:47, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 19.05.2013 17:25, Simon Riggs wrote:

>> So while I believe that the checkpointer might have an incorrect TLI
>> and that you've seen a bug, what isn't clear is that the checkpointer
>> is the only process that would see an incorrect TLI, or why such
>> processes see an incorrect TLI. It seems equally likely at this point
>> that the TLI may be set incorrectly somehow and that is why it is
>> being read incorrectly.
>
>
> Yeah. I spent some more time debugging this, and found the culprit. In
> CreateRestartPoint, we do this:
>
>> /*
>> * Update ThisTimeLineID to the timeline we're currently
>> replaying,
>> * so that we install any recycled segments on that
>> timeline.
>> *
>> * There is no guarantee that the WAL segments will be
>> useful on the
>> * current timeline; if recovery proceeds to a new
>> timeline right
>> * after this, the pre-allocated WAL segments on this
>> timeline will
>> * not be used, and will go wasted until recycled on the
>> next
>> * restartpoint. We'll live with that.
>> */
>> (void) GetXLogReplayRecPtr(&ThisTimeLineID);
>
>
> Here's what happens:
>
> 0. System is in recovery
> 1. checkpointer process starts creating a restartpoint.
> 2. Startup process ends recovery, creates end-of-recovery record, sets the
> shared ThisTimeLineID value to 2, and exits.
> 3. checkpointer process calls RecoveryInProgess() while performing the
> restartpoint (there are RecoveryInProgress() calls in many places, e.g in
> XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes
> ThisTimeLineID to 2.
> 4. At the end of restartpoint, the checkpointer process does the above
> GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to
> 1.
> 5. Checkpointer calls RecoveryInProgress, which returns true immediately, as
> LocalRecoveryInProgress is already set.
> 6. Checkpointer performs the first checkpoint, with ThisTimeLineID set
> incorrectly to 1.

That description looks correct to me from the code.

> Not sure what the best fix would be. Perhaps change the code in
> CreateRestartPoint() to do something like this instead:
>
> GetXLogReplayRecPtr(&replayTLI);
> if (RecoveryInProgress())
> ThisTimeLineID = replayTLI;

Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.

Suggested patch attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
avoid_setting_tli.v1.patch	application/octet-stream	1.5 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-20 19:40:14
Message-ID:	519A7C1E.2050604@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.05.2013 22:18, Simon Riggs wrote:
> On 20 May 2013 18:47, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> wrote:
>> Not sure what the best fix would be. Perhaps change the code in
>> CreateRestartPoint() to do something like this instead:
>>
>> GetXLogReplayRecPtr(&replayTLI);
>> if (RecoveryInProgress())
>> ThisTimeLineID = replayTLI;
>
> Installing a few extra WAL files doesn't seem to be a good reason to
> mess with such an important variable as ThisTimeLineID, especially
> since its such a rare event and hardly worth optimising for.
>
> I would prefer it if we didn't set ThisTimeLineID at all in that way,
> or anywhere else. If we do, it needs to be done safely, otherwise any
> caller could make the same mistake.

> Suggested patch attached.

> @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
> * not be used, and will go wasted until recycled on the next
> * restartpoint. We'll live with that.
> */
> - (void) GetXLogReplayRecPtr(&ThisTimeLineID);
> -
> - RemoveOldXlogFiles(_logSegNo, endptr);
> -
> - /*
> - * Make more log segments if needed. (Do this after recycling old log
> - * segments, since that may supply some of the needed files.)
> - */
> PreallocXlogFiles(endptr);
> }

That's essentially reverting commit 343ee00b7, resurrecting the bug that
that fixed.

> @@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
> SpinLockAcquire(&xlogctl->info_lck);
> recptr = xlogctl->lastReplayedEndRecPtr;
> tli = xlogctl->lastReplayedTLI;
> + if (replayTLI && xlogctl->SharedRecoveryInProgress)
> + *replayTLI = tli;
> SpinLockRelease(&xlogctl->info_lck);
>
> - if (replayTLI)
> - *replayTLI = tli;
> return recptr;
> }

That breaks the only remaining caller that passes a replayTLI pointer,
GetStandbyFlushRecPtr(); *replayTLI points to a local variable in that
function, which is left uninitialized if !xlogctl->SharedRecoveryInProgress.

- Heikki

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-20 21:00:59
Message-ID:	CA+U5nM+Lp8ryxRz-qen6Jr7v+2N40ViarvLo6mMzQ16hYoTzcQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 May 2013 20:40, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 20.05.2013 22:18, Simon Riggs wrote:
>>
>> On 20 May 2013 18:47, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> Not sure what the best fix would be. Perhaps change the code in
>>>
>>> CreateRestartPoint() to do something like this instead:
>>>
>>> GetXLogReplayRecPtr(&replayTLI);
>>> if (RecoveryInProgress())
>>> ThisTimeLineID = replayTLI;
>>
>>
>> Installing a few extra WAL files doesn't seem to be a good reason to
>> mess with such an important variable as ThisTimeLineID, especially
>> since its such a rare event and hardly worth optimising for.
>>
>> I would prefer it if we didn't set ThisTimeLineID at all in that way,
>> or anywhere else. If we do, it needs to be done safely, otherwise any
>> caller could make the same mistake.
>
>
>> Suggested patch attached.
>
>
>> @@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
>>
>> * not be used, and will go wasted until recycled on the
>> next
>> * restartpoint. We'll live with that.
>> */
>> - (void) GetXLogReplayRecPtr(&ThisTimeLineID);
>> -
>> - RemoveOldXlogFiles(_logSegNo, endptr);
>> -
>> - /*
>> - * Make more log segments if needed. (Do this after
>> recycling old log
>> - * segments, since that may supply some of the needed
>> files.)
>> - */
>> PreallocXlogFiles(endptr);
>> }
>
>
> That's essentially reverting commit 343ee00b7, resurrecting the bug that
> that fixed.

Hmm, OK.

Then we should using the correct value, not the one that patch set. It
certainly worked at that time, but not in a principled way.

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

Patch attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
set_lastReplayedTLI.v1.patch	application/octet-stream	799 bytes

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-21 06:46:31
Message-ID:	519B1847.3040708@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.05.2013 00:00, Simon Riggs wrote:
> When we set the new timeline we should be updating all values that
> might be used elsewhere. If we do that, then no matter when or how we
> run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.
>
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -5838,8 +5838,16 @@ StartupXLOG(void)
> }
>
> /* Save the selected TimeLineID in shared memory, too */
> - XLogCtl->ThisTimeLineID = ThisTimeLineID;
> - XLogCtl->PrevTimeLineID = PrevTimeLineID;
> + {
> + /* use volatile pointer to prevent code rearrangement */
> + volatile XLogCtlData *xlogctl = XLogCtl;
> +
> + SpinLockAcquire(&xlogctl->info_lck);
> + XLogCtl->ThisTimeLineID = ThisTimeLineID;
> + XLogCtl->lastReplayedTLI = ThisTimeLineID;
> + XLogCtl->PrevTimeLineID = PrevTimeLineID;
> + SpinLockRelease(&xlogctl->info_lck);
> + }

Hmm. lastReplayedTLI is supposed to be the timeline of the last record
that was replayed, ie. the timeline corresponding lastReplayedEndRecPtr.
I think it would work, but it feels like it could be an easy source of
bugs in the future.

It might be a good idea to change walsender to not store that in
ThisTimeLineID, though. It used to make sense for ThisTimeLineID to
track the current recovery timeline in 9.2 and below, but now that
walsender can be sending any older timeline, using ThisTimeLineID to
store the latest one seems confusing.

- Heikki

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-21 08:26:01
Message-ID:	CA+U5nM+zJjyA7aKXYT4eMD5npdEfbiqx3nBaAFvmJKLEeytGrQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 May 2013 07:46, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 21.05.2013 00:00, Simon Riggs wrote:
>>
>> When we set the new timeline we should be updating all values that
>> might be used elsewhere. If we do that, then no matter when or how we
>> run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.
>>
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -5838,8 +5838,16 @@ StartupXLOG(void)
>> }
>>
>> /* Save the selected TimeLineID in shared memory, too */
>> - XLogCtl->ThisTimeLineID = ThisTimeLineID;
>> - XLogCtl->PrevTimeLineID = PrevTimeLineID;
>> + {
>> + /* use volatile pointer to prevent code rearrangement */
>> + volatile XLogCtlData *xlogctl = XLogCtl;
>> +
>> + SpinLockAcquire(&xlogctl->info_lck);
>> + XLogCtl->ThisTimeLineID = ThisTimeLineID;
>> + XLogCtl->lastReplayedTLI = ThisTimeLineID;
>> + XLogCtl->PrevTimeLineID = PrevTimeLineID;
>> + SpinLockRelease(&xlogctl->info_lck);
>> + }
>
>
> Hmm. lastReplayedTLI is supposed to be the timeline of the last record that
> was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. I think
> it would work, but it feels like it could be an easy source of bugs in the
> future.

I'm OK with that principle, as long as we don't touch ThisTimeLineID,
which has been the source of multiple bugs.

So we should set the TLI explicitly before installing, like attached patch.

Otherwise we'd need multiple permanent TLIs which would be overkill.

I feel there are problems because we set the newly selected TLI from
startup process into shared memory, then some time later we set
SharedRecoveryInProgress = false. That timing window isn't good, but I
don't see a different way.

> It might be a good idea to change walsender to not store that in
> ThisTimeLineID, though. It used to make sense for ThisTimeLineID to track
> the current recovery timeline in 9.2 and below, but now that walsender can
> be sending any older timeline, using ThisTimeLineID to store the latest one
> seems confusing.

Agreed, but looks like too much code to touch that lightly.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
install_xlog_right.v1.patch	application/octet-stream	1.7 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast promotion failure
Date:	2013-05-21 20:33:17
Message-ID:	CA+U5nM+pWjAYCtTBpYCi-UtwtdJru6LiM7O2D4mFiabY=i194Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 May 2013 09:26, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> I'm OK with that principle...

Well, after fighting some more with that, I've gone with the, er,
principle of slightly less ugliness.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services