Skip checkpoint on promoting from streaming replication

Lists: pgsql-hackers
From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Skip checkpoint on promoting from streaming replication
Date: 2012-06-08 08:22:01
Message-ID: 20120608.172201.126345187.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I have a problem with promoting from hot-standby that exclusive
checkpointing retards completion of promotion.

This checkpoint is "shutdown checkpoint" as a convention in
realtion to TLI increment according to the comment shown below. I
suppose "shutdown checkpoint" means exclusive checkpoint - in
other words, checkpoint without WAL inserts meanwhile.

> * one. This is not particularly critical, but since we may be
> * assigning a new TLI, using a shutdown checkpoint allows us to have
> * the rule that TLI only changes in shutdown checkpoints, which
> * allows some extra error checking in xlog_redo.

I depend on this and suppose we can omit it if latest checkpoint
has been taken so as to be able to do crash recovery thereafter.
This condition could be secured by my another patch for
checkpoint_segments on standby.

After applying this patch, checkpoint after archive recovery at
near the end of StartupXLOG() will be skiped on the condition
follows,

- WAL receiver has been launched so far. (using WalRcvStarted())

- XLogCheckpointNeeded() against replayEndRecPtr says no need of
checkpoint.

What do you think about this?

This patch needs WalRcvStarted() introduced by my another patch.

http://archives.postgresql.org/pgsql-hackers/2012-06/msg00287.php

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachment Content-Type Size
skip_ckpt_after_rcv_on_stby_20120608_1.patch text/x-patch 1.6 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-08 09:28:12
Message-ID: CA+U5nMJ-C6PdN6eqD6JFZTViGvCW1xV_v38=+CWR0vnQoKO3Mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8 June 2012 09:22, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> I have a problem with promoting from hot-standby that exclusive
> checkpointing retards completion of promotion.

Agreed, we have that problem.

> I depend on this and suppose we can omit it if latest checkpoint
> has been taken so as to be able to do crash recovery thereafter.

I don't see any reason to special case this. If a checkpoint has no
work to do, then it will go very quickly. Why seek to speed it up even
further?

> This condition could be secured by my another patch for
> checkpoint_segments on standby.

More frequent checkpoints are very unlikely to secure a condition that
no checkpoint at all is required at failover.

Making a change that has a negative effect for everybody, in the hope
of sometimes improving performance for something that is already fast
doesn't seem a good trade off to me.

Regrettably, the line of thought explained here does not seem useful to me.

As you know, I was working on avoiding shutdown checkpoints completely
myself. You are welcome to work on the approach Fujii and I discussed.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: simon(at)2ndQuadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-12 02:38:31
Message-ID: 20120612.113831.63694928.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, sorry for vague understanding.

> > I depend on this and suppose we can omit it if latest checkpoint
> > has been taken so as to be able to do crash recovery thereafter.
>
> I don't see any reason to special case this. If a checkpoint has no
> work to do, then it will go very quickly. Why seek to speed it up even
> further?

I want the standby to start to serve as soon as possible even by
a few seconds on failover in a HA cluster.

> > This condition could be secured by my another patch for
> > checkpoint_segments on standby.
>
> More frequent checkpoints are very unlikely to secure a condition that
> no checkpoint at all is required at failover.

I understand that checkpoint at the end of recovery is
indispensable to ensure the availability of crash recovery
afterward. Putting aside the convention about TLI increment and
shutdown checkpoint, shutdown checkpoints there seems for me to
be omittable if (and not 'only if', I suppose) crash recovery is
available at the time.

Shutdown checkpoint itself seems dispansable to me, but I'm
shamingly not convinced so taking the TLI convention into
consideration.

> Making a change that has a negative effect for everybody, in the hope
> of sometimes improving performance for something that is already fast
> doesn't seem a good trade off to me.

Hmm.. I suppose the negative effect you've pointed is possible
slowing down of hot-standby by the extra checkpoints being
discussed in another thread, is it correct? Could you accept this
kind of modification if it could be turned off by, say, GUC?

> Regrettably, the line of thought explained here does not seem useful to me.
>
> As you know, I was working on avoiding shutdown checkpoints completely
> myself. You are welcome to work on the approach Fujii and I discussed.

Sorry, I'm afraid that I've failed to find that discussion. Could
you let me have a pointer to that? Of cource I'd be very happy if
the checkpoints are completely avoided on the approach.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-12 08:52:43
Message-ID: CA+U5nMLwSdwUpqkF0F4WnWdURSLn1WrMfTvjWN0gkz9QbFraRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 June 2012 03:38, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hello, sorry for vague understanding.
>
>> > I depend on this and suppose we can omit it if latest checkpoint
>> > has been taken so as to be able to do crash recovery thereafter.
>>
>> I don't see any reason to special case this. If a checkpoint has no
>> work to do, then it will go very quickly. Why seek to speed it up even
>> further?
>
> I want the standby to start to serve as soon as possible even by
> a few seconds on failover in a HA cluster.

Please implement a prototype and measure how many seconds we are discussing.

>> > This condition could be secured by my another patch for
>> > checkpoint_segments on standby.
>>
>> More frequent checkpoints are very unlikely to secure a condition that
>> no checkpoint at all is required at failover.
>
> I understand that checkpoint at the end of recovery is
> indispensable to ensure the availability of crash recovery
> afterward. Putting aside the convention about TLI increment and
> shutdown checkpoint, shutdown checkpoints there seems for me to
> be omittable if (and not 'only if', I suppose) crash recovery is
> available at the time.
>
> Shutdown checkpoint itself seems dispansable to me, but I'm
> shamingly not convinced so taking the TLI convention into
> consideration.
>
>
>> Making a change that has a negative effect for everybody, in the hope
>> of sometimes improving performance for something that is already fast
>> doesn't seem a good trade off to me.
>
> Hmm.. I suppose the negative effect you've pointed is possible
> slowing down of hot-standby by the extra checkpoints being
> discussed in another thread, is it correct? Could you accept this
> kind of modification if it could be turned off by, say, GUC?

This proposal is for a performance enhancement. We normally require
some proof that the enhancement is real and that it doesn't have a
poor effect on people not using it. Please make measurements.

It's easy to force more frequent checkpointsif you wish them, so
please compare the case of more frequent checkpoints.

>> Regrettably, the line of thought explained here does not seem useful to me.
>>
>> As you know, I was working on avoiding shutdown checkpoints completely
>> myself. You are welcome to work on the approach Fujii and I discussed.
>
> Sorry, I'm afraid that I've failed to find that discussion. Could
> you let me have a pointer to that? Of cource I'd be very happy if
> the checkpoints are completely avoided on the approach.

Discussion on a patch submitted to me to the Januray 2012 CommitFest
to reduce failover time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: simon(at)2ndQuadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-12 11:43:07
Message-ID: 20120612.204307.31812587.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, Thank you to head me the previous discussion. I'll
consider them from now.

> > I want the standby to start to serve as soon as possible even by
> > a few seconds on failover in a HA cluster.
>
> Please implement a prototype and measure how many seconds we
> are discussing.

I'm sorry to have omitted measurement data. (But it might be
shown in previous discussion.)

Our previous measurement of failover of PostgreSQL 9.1 +
Pacemaker on some workload showed that shutdown snapshot takes 8
seconds out of 42 seconds of total failover time (about 20%).

OS : RHEL6.1-64
DBMS : PostgeSQL 9.1.1
HA : pacemaker-1.0.11-1.2.2 x64
Repl : sync
Workload : master : pgbench / scale factor = 100 (aprx. 1.5GB)
standby: none (warm-standby)

shared_buffers = 2.5GB
wal_buffers = 4MB
checkpoint_segments = 300
checkpoint_timeout = 15min
checpoint_completion_target = 0.7
archive_mode = on

WAL segment comsumption was about 310 segments / 15 mins under
the condition above.

> This proposal is for a performance enhancement. We normally require
> some proof that the enhancement is real and that it doesn't have a
> poor effect on people not using it. Please make measurements.

On the benchmark above, extra load by more frequent (but the same
to the its master) checkpoint is not a problem. On the other
hand, failover time is expected to be shortened to 34 seconds
from 42 seconds by omitting the shutdown checkpoint.
(But I have not measured that..)

> Discussion on a patch submitted to me to the Januray 2012 CommitFest
> to reduce failover time.

Thank you and I'm sorry for missing it. I've found that
discussions and read them from now.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: simon(at)2ndQuadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-18 08:42:17
Message-ID: 20120618.174217.155445557.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, This is the new version of the patch.

Your patch introduced new WAL record type XLOG_END_OF_RECOVERY to
mark the chenge point of TLI. But I think the information is
already stored in history files and already ready to use in
current code.

I looked into your first patch and looked over the discussion on
it, and find that my understanding that TLI switch is operable
also for crash recovery besides archive recovery was half
wrong. The correct half was that it can be operable for crash
recovery if we properly set TimeLineID in StartupXLOG().

To achieve this, I added a new field 'latestTLI' (more proper
name is welcome) and make it always catch up with the latest TLI
with no relation to checkpoints. Then set the recovery target in
StartupXLOG() referring it. Additionaly, in previous patch, I
checked only checkpoint intervals but this ended with no effect
as you said. Because the WAL files in pg_xlog are preserved as
many as required for crash recovery, as I knew...

The new patch seems to work correctly for changing of TLI without
checkpoint following. And archive recovery and PITR also seems
to work correctly. The test script for the former is attached
too.

The new patch consists of two parts. These might should be
treated as two separate ones..

1. Allow_TLI_Increment_without_Checkpoint_20120618.patch

Removes the assumption after the 'convension' that TLI should
be incremented only on shutdown checkpoint. This seems actually
has no problem as the commnet(This is not particularly
critical).

2. Skip_Checkpoint_on_Promotion_20120618.patch

Skips checkpoint if redo record can be read in-place.

3. Test script for TLI increment patch.

This is only to show how the patch is tested. The point is
creating TLI increment point not followed by any kind of
checkpoints. pg_controldata shows like following after running
this test script. Latest timeline ID is the new field.

> pg_control version number: 923
> Database cluster state: in production
!> Latest timeline ID: 2
> Latest checkpoint location: 0/2000058
> Prior checkpoint location: 0/2000058
> Latest checkpoint's REDO location: 0/2000020
!> Latest checkpoint's TimeLineID: 1

We will see this changing as follows after crash recovery,

> Latest timeline ID: 2
> Latest checkpoint location: 0/54D9918
> Prior checkpoint location: 0/2000058
> Latest checkpoint's REDO location: 0/54D9918
> Latest checkpoint's TimeLineID: 2

Then, we should see both two 'ABCDE...'s and two 'VWXYZ...'s in
the table after the crash recovery.

What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.

Attachment Content-Type Size
Allow_TLI_Increment_without_Checkpoint_20120618.patch text/x-patch 3.8 KB
Skip_Checkpoint_on_Promotion_20120618.patch text/x-patch 852 bytes

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-18 16:31:51
Message-ID: CAHGQGwFFoQEOaUV65Brgr3mhGOY+bcCpgWWyCzjp1SDyRctKyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 18, 2012 at 5:42 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> What do you think about this?

What happens if the server skips an end-of-recovery checkpoint, is promoted to
the master, runs some write transactions, crashes and restarts automatically
before it completes checkpoint? In this case, the server needs to do crash
recovery from the last checkpoint record with old timeline ID to the latest WAL
record with new timeline ID. How does crash recovery do recovery beyond
timeline?

Regards,

--
Fujii Masao


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: masao(dot)fujii(at)gmail(dot)com
Cc: simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-19 08:30:46
Message-ID: 20120619.173046.88698848.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thank you.

> What happens if the server skips an end-of-recovery checkpoint,
> is promoted to the master, runs some write transactions,
> crashes and restarts automatically before it completes
> checkpoint? In this case, the server needs to do crash recovery
> from the last checkpoint record with old timeline ID to the
> latest WAL record with new timeline ID. How does crash recovery
> do recovery beyond timeline?

Basically the same as archive recovery as far as I saw. It is
already implemented to work in that way.

After this patch applied, StartupXLOG() gets its
recoveryTargetTLI from the new field lastestTLI in the control
file instead of the latest checkpoint. And the latest checkpoint
record informs its TLI and WAL location as before, but its TLI
does not have a significant meaning in the recovery sequence.

Suggest the case following,

|seg 1 | seg 2 |
TLI 1 |...c......|....000000|
C P X
TLI 2 |........00|

* C - checkpoint, P - promotion, X - crash just after here

This shows the situation that the latest checkpoint(restartpoint)
has been taken at TLI=1/SEG=1/OFF=4 and promoted at
TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8.
Promotion itself inserts no wal records but creates a copy of the
current segment for new TLI. the file for TLI=2/SEG=1 should not
exist. (Who will create it?)

The control file will looks as follows

latest checkpoint : TLI=1/SEG=1/OFF=4
latest TLI : 2

So the crash recovery sequence starts from SEG=1/LOC=4.
expectedTLIs will be (2, 1) so 1 will naturally be selected as
the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI().

In the closer view, startup constructs expectedTLIs reading the
timeline hisotry file corresponds to the recoveryTargetTLI. Then
runs the recovery sequence from the redo point of the latest
checkpoint using WALs with the largest TLI - which is
distinguised by its file name, not header - within the
expectedTLIs in XLogPageRead(). The only difference to archive
recovery is XLogFileReadAnyTLI() reads only the WAL files already
sit in pg_xlog directory, and not reaches archive. The pages with
the new TLI will be naturally picked up as mentioned above in
this sequence and then will stop at the last readable record.

latestTLI field in the control file is updated just after the TLI
was incremented and the new WAL files with the new TLI was
created. So the crash recovery sequence won't stop before
reaching the WAL with new TLI disignated in the control file.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-19 15:57:49
Message-ID: CAHGQGwH8-Ju018zo6KNakqFZCDans_jU5AjF+JUTR7YsjUjD0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 19, 2012 at 5:30 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Thank you.
>
>> What happens if the server skips an end-of-recovery checkpoint,
>> is promoted to the master, runs some write transactions,
>> crashes and restarts automatically before it completes
>> checkpoint? In this case, the server needs to do crash recovery
>> from the last checkpoint record with old timeline ID to the
>> latest WAL record with new timeline ID. How does crash recovery
>> do recovery beyond timeline?
>
> Basically the same as archive recovery as far as I saw. It is
> already implemented to work in that way.
>
> After this patch applied, StartupXLOG() gets its
> recoveryTargetTLI from the new field lastestTLI in the control
> file instead of the latest checkpoint. And the latest checkpoint
> record informs its TLI and WAL location as before, but its TLI
> does not have a significant meaning in the recovery sequence.
>
> Suggest the case following,
>
>      |seg 1     | seg 2    |
> TLI 1 |...c......|....000000|
>          C           P  X
> TLI 2            |........00|
>
> * C - checkpoint, P - promotion, X - crash just after here
>
> This shows the situation that the latest checkpoint(restartpoint)
> has been taken at TLI=1/SEG=1/OFF=4 and promoted at
> TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8.
> Promotion itself inserts no wal records but creates a copy of the
> current segment for new TLI. the file for TLI=2/SEG=1 should not
> exist. (Who will create it?)
>
> The control file will looks as follows
>
> latest checkpoint : TLI=1/SEG=1/OFF=4
> latest TLI        : 2
>
> So the crash recovery sequence starts from SEG=1/LOC=4.
> expectedTLIs will be (2, 1) so 1 will naturally be selected as
> the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI().
>
> In the closer view, startup constructs expectedTLIs reading the
> timeline hisotry file corresponds to the recoveryTargetTLI. Then
> runs the recovery sequence from the redo point of the latest
> checkpoint using WALs with the largest TLI - which is
> distinguised by its file name, not header - within the
> expectedTLIs in XLogPageRead(). The only difference to archive
> recovery is XLogFileReadAnyTLI() reads only the WAL files already
> sit in pg_xlog directory, and not reaches archive. The pages with
> the new TLI will be naturally picked up as mentioned above in
> this sequence and then will stop at the last readable record.
>
> latestTLI field in the control file is updated just after the TLI
> was incremented and the new WAL files with the new TLI was
> created. So the crash recovery sequence won't stop before
> reaching the WAL with new TLI disignated in the control file.

Is it guaranteed that all the files (e.g., the latest timeline history file)
required for such crash recovery exist in pg_xlog? If yes, your
approach might work well.

Regards,

--
Fujii Masao


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: masao(dot)fujii(at)gmail(dot)com
Cc: simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-06-22 04:03:16
Message-ID: 20120622.130316.106246558.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

> Is it guaranteed that all the files (e.g., the latest timeline history file)
> required for such crash recovery exist in pg_xlog? If yes, your
> approach might work well.

Particularly regarding the promotion, the files reuiqred are the
history file of the latest timeline, the WAL file including redo
location of the latest restartpoint, and all WAL files after the
first one each of which is of appropriate timeline.

On current (9.2/9.3dev) implement, as far as I know, archive
recovery and stream replication will create regular WAL files
requireded during recovery sequence in slave's pg_xlog
direcotory. And only restart point removes them older than the
one on which the restart point takes place. If so, all required
files mentioned above should be in pg_xlog directory. Is there
something I've forgotten?

However, it will be more robust if we could check if all required
files available on promotion. I could guess two approaches which
might accomplish that.

1. Record the id of the WAL segment which is not in pg_xlog as
regular WAL file on reading.

For example, if we modify archive recovery so as to make work
WAL files out of pg_xlog or give a special name which cannot
be refferred to for fetching them in crash recovery afterward,
record the id of the segment. The shutdown checkpoint on
promotion or end of recovery cannot be skipped if this
recorded segment id is equal or larger than redo point of the
latest of checkpoint. This approach of cource reduces the
chance to skip shutdown checkpoint than forcing to copy all
required files into pg_xlog, but still seems to be effective
for most common cases, say promoting enough minutes after
wal-streaming started to have a restart point on a WAL in
pg_xlog.

I hope this is promising.

Temporary WAL file for streaming? It seems for me to make
shutdown checkpoint mandatory since no WAL files before
promotion becomes accessible at the moment. On the other hand,
preserving somehow the WALs after the latest restartpoint
seems to have not significant difference to the current way
from the viewpoint of disk consumption.

2. Check for all required WAL files on promotion or end of
recovery.

We could check the existence of all required files on
promotion scanning with the manner similar to recovery. But
this requires to add the codes similar to the existing or
induces the labor to weave new function into existing
code. Furthurmore, this seems to take certain time on
promotion (or end of recovery).

The discussion about temporary wal files would be the same to 1.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-08-09 09:45:55
Message-ID: CA+U5nMKRFuFNWSQ2-J4cLr2UoE+2PbAnZHprm2xfs8qOpSnh=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22 June 2012 05:03, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:

> I hope this is promising.

I've reviewed this and thought about it over some time.

At first I was unhappy that you'd removed the restriction that
timelines only change on a shutdown checkpoint. But the reality is
timelines change at any point in the WAL stream - the only way to tell
between end of WAL and a timeline change is by looking for later
timelines.

The rest of the logic seems OK, but its a big thing we're doing here
and so it will take longer yet. Putting all the theory into comments
in code would certainly help here.

I don't have much else to say on this right now. I'm not committing
anything on this now since I'm about to go on holiday, but I'll be
looking at this when I get back.

For now, I'm going to mark this as Returned With Feedback, but please
don't be discouraged by that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: simon(at)2ndQuadrant(dot)com
Cc: masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-08-30 07:05:49
Message-ID: 20120830.160549.230735413.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, sorry for long absense.

> At first I was unhappy that you'd removed the restriction that
> timelines only change on a shutdown checkpoint. But the reality is
> timelines change at any point in the WAL stream - the only way to tell
> between end of WAL and a timeline change is by looking for later
> timelines.

Yes, I felt uncomfortable with that point. The overlooking map
of timeline evolution on WAL stream seems obscure, and it should
have been made clear to do this. I couldn't show the clear map
for this CF.

> The rest of the logic seems OK, but its a big thing we're doing here
> and so it will take longer yet. Putting all the theory into comments
> in code would certainly help here.

Ok, I agreed.

> I don't have much else to say on this right now. I'm not committing
> anything on this now since I'm about to go on holiday, but I'll be
> looking at this when I get back.

Have a nice holyday.

> For now, I'm going to mark this as Returned With Feedback, but please
> don't be discouraged by that.

I think we have enough time to think about that yet, and I
believe this will be worth doing.

Thank you.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: simon(at)2ndQuadrant(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-10-18 20:22:06
Message-ID: 20121018202206.GR3763@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This patch seems to have been neglected by both its submitter and the
reviewer. Also, Simon said he was going to set it
returned-with-feedback on his last reply, but I see it as needs-review
still in the CF app. Is this something that is going to be reconsidered
and resubmitted for the next commitfest? If so, please close it up in
the current one.

Thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2012-10-18 22:19:02
Message-ID: CA+U5nMKh1q_68p_us4cLukGcW9NgKghqZqaefMnjJkVEn+FKvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18 October 2012 21:22, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:

> This patch seems to have been neglected by both its submitter and the
> reviewer. Also, Simon said he was going to set it
> returned-with-feedback on his last reply, but I see it as needs-review
> still in the CF app. Is this something that is going to be reconsidered
> and resubmitted for the next commitfest? If so, please close it up in
> the current one.

I burned time on the unlogged table problems, so haven't got round to
this yet. I'm happier than I was with this.

I'm also conscious this is very important and there are no later patch
dependencies, so there's no rush to commit it and every reason to make
sure it happens without any mistakes. It will be there for 9.3.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-06 21:58:49
Message-ID: CA+U5nMKvzeUWdD7FMyTG+gyg=xiP+9G_8A8fmhod2L0om+wVRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9 August 2012 10:45, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 22 June 2012 05:03, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
>> I hope this is promising.
>
> I've reviewed this and thought about it over some time.

I've been torn between the need to remove the checkpoint for speed and
being worried about the implications of doing so.

We promote in multiple use cases. When we end a PITR, or are
performing a switchover, it doesn't really matter how long the
shutdown checkpoint takes, so I'm inclined to leave it there in those
cases. For failover, we need fast promotion.

So my thinking is to make pg_ctl promote -m fast
be the way to initiate a fast failover that skips the shutdown checkpoint.

That way all existing applications work the same as before, while new
users that explicitly choose to do so will gain from the new option.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-24 16:24:28
Message-ID: CA+U5nMK8gjvWsJo9EvirYvpBa12x7DfW+9xO5T4FuFS1QSLM-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 January 2013 21:58, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 9 August 2012 10:45, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 22 June 2012 05:03, Kyotaro HORIGUCHI
>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>>> I hope this is promising.
>>
>> I've reviewed this and thought about it over some time.
>
> I've been torn between the need to remove the checkpoint for speed and
> being worried about the implications of doing so.
>
> We promote in multiple use cases. When we end a PITR, or are
> performing a switchover, it doesn't really matter how long the
> shutdown checkpoint takes, so I'm inclined to leave it there in those
> cases. For failover, we need fast promotion.
>
> So my thinking is to make pg_ctl promote -m fast
> be the way to initiate a fast failover that skips the shutdown checkpoint.
>
> That way all existing applications work the same as before, while new
> users that explicitly choose to do so will gain from the new option.

Here's a patch to skip checkpoint when we do

pg_ctl promote -m fast

We keep the end of recovery checkpoint in all other cases.

The only thing left from Kyotaro's patch is a single line of code -
the call to ReadCheckpointRecord() that checks to see if the WAL
records for the last two restartpoints is on disk, which was an
important line of code.

Patch implements a new record type XLOG_END_OF_RECOVERY that behaves
on replay like a shutdown checkpoint record. I put this back in from
my patch because I believe its important that we have a clear place
where the WAL history changes timelineId. WAL format change bump
required.

So far this is only barely tested, but considering time goes on, I
thought people might want to pass comment on this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
fast_promote.v3.patch application/octet-stream 13.8 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-24 16:52:09
Message-ID: 510166B9.6080705@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24.01.2013 18:24, Simon Riggs wrote:
> On 6 January 2013 21:58, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>> I've been torn between the need to remove the checkpoint for speed and
>> being worried about the implications of doing so.
>>
>> We promote in multiple use cases. When we end a PITR, or are
>> performing a switchover, it doesn't really matter how long the
>> shutdown checkpoint takes, so I'm inclined to leave it there in those
>> cases. For failover, we need fast promotion.
>>
>> So my thinking is to make pg_ctl promote -m fast
>> be the way to initiate a fast failover that skips the shutdown checkpoint.
>>
>> That way all existing applications work the same as before, while new
>> users that explicitly choose to do so will gain from the new option.
>
> Here's a patch to skip checkpoint when we do
>
> pg_ctl promote -m fast
>
> We keep the end of recovery checkpoint in all other cases.

Hmm, there seems to be no way to do a "fast" promotion with a trigger file.

I'm a bit confused why there needs to be special mode for this. Can't we
just always do the "fast" promotion? I agree that there's no urgency
when you're doing PITR, but shouldn't do any harm either. Or perhaps
always do "fast" promotion when starting up from standby mode, and
"slow" otherwise.

Are we comfortable enough with this to skip the checkpoint after crash
recovery?

I may be missing something, but it looks like after a "fast" promotion,
you don't request a new checkpoint. So it can take quite a while for the
next checkpoint to be triggered by checkpoint_timeout/segments. That
shouldn't be a problem, but I feel that it'd be prudent to request a new
checkpoint immediately (not necessarily an "immediate" checkpoint, though).

> The only thing left from Kyotaro's patch is a single line of code -
> the call to ReadCheckpointRecord() that checks to see if the WAL
> records for the last two restartpoints is on disk, which was an
> important line of code.

Why's that important, just for paranoia? If the last two restartpoints
have disappeared, something's seriously wrong, and you will be in
trouble e.g if you crash at that point. Do we need to be extra paranoid
when doing a "fast" promotion?

> Patch implements a new record type XLOG_END_OF_RECOVERY that behaves
> on replay like a shutdown checkpoint record. I put this back in from
> my patch because I believe its important that we have a clear place
> where the WAL history changes timelineId. WAL format change bump
> required.

Agreed, such a WAL record is essential.

At replay, an end-of-recovery record should be a signal to the hot
standby mechanism that there are no transactions running in the master
at that point, same as a shutdown checkpoint.

- Heikki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-24 17:44:26
Message-ID: CA+U5nMLfun0KcusiEn7iB1o-HYcGQ8B0VG0HE7U9byJtf9r6Sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24 January 2013 16:52, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 24.01.2013 18:24, Simon Riggs wrote:
>>
>> On 6 January 2013 21:58, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>>>
>>> I've been torn between the need to remove the checkpoint for speed and
>>>
>>> being worried about the implications of doing so.
>>>
>>> We promote in multiple use cases. When we end a PITR, or are
>>> performing a switchover, it doesn't really matter how long the
>>> shutdown checkpoint takes, so I'm inclined to leave it there in those
>>> cases. For failover, we need fast promotion.
>>>
>>> So my thinking is to make pg_ctl promote -m fast
>>> be the way to initiate a fast failover that skips the shutdown
>>> checkpoint.
>>>
>>> That way all existing applications work the same as before, while new
>>> users that explicitly choose to do so will gain from the new option.
>>
>>
>> Here's a patch to skip checkpoint when we do
>>
>> pg_ctl promote -m fast
>>
>> We keep the end of recovery checkpoint in all other cases.
>
>
> Hmm, there seems to be no way to do a "fast" promotion with a trigger file.

True. I thought we were moving away from trigger files to use of "promote"

> I'm a bit confused why there needs to be special mode for this. Can't we
> just always do the "fast" promotion? I agree that there's no urgency when
> you're doing PITR, but shouldn't do any harm either. Or perhaps always do
> "fast" promotion when starting up from standby mode, and "slow" otherwise.
>
> Are we comfortable enough with this to skip the checkpoint after crash
> recovery?

I'm not. Maybe if we get no bugs we can make it do this always, in next release.

It;s fast when it needs to be and safe otherwise.

> I may be missing something, but it looks like after a "fast" promotion, you
> don't request a new checkpoint. So it can take quite a while for the next
> checkpoint to be triggered by checkpoint_timeout/segments. That shouldn't be
> a problem, but I feel that it'd be prudent to request a new checkpoint
> immediately (not necessarily an "immediate" checkpoint, though).

I thought of that and there is a long comment to explain why I didn't.

Two problems:

1) an immediate checkpoint can cause a disk/resource usage spike,
which is definitely not what you need just when a spike of connections
and new SQL hits the system.

2) If we did that, we would have an EndOfRecovery record, some other
records and then a Shutdown checkpoint.
As I right this, (2) is wrong, because we shouldn't do a a Shutdown
checkpoint anyway.

But I still think (1) is a valid concern.

>> The only thing left from Kyotaro's patch is a single line of code -
>> the call to ReadCheckpointRecord() that checks to see if the WAL
>> records for the last two restartpoints is on disk, which was an
>> important line of code.
>
>
> Why's that important, just for paranoia? If the last two restartpoints have
> disappeared, something's seriously wrong, and you will be in trouble e.g if
> you crash at that point. Do we need to be extra paranoid when doing a "fast"
> promotion?

The check is cheap, so what do we gain by skipping the check?

>> Patch implements a new record type XLOG_END_OF_RECOVERY that behaves
>> on replay like a shutdown checkpoint record. I put this back in from
>> my patch because I believe its important that we have a clear place
>> where the WAL history changes timelineId. WAL format change bump
>> required.
>
>
> Agreed, such a WAL record is essential.
>
> At replay, an end-of-recovery record should be a signal to the hot standby
> mechanism that there are no transactions running in the master at that
> point, same as a shutdown checkpoint.

I had a reason why I didn't do that, but it seems to have slipped my mind.

If I can't remember, I'll add it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-24 18:54:52
Message-ID: CA+U5nM+6EAhQnqV9EnhsML9TBnMScA7eJ4FMLm9AbaR80ByfPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24 January 2013 17:44, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> At replay, an end-of-recovery record should be a signal to the hot standby
>> mechanism that there are no transactions running in the master at that
>> point, same as a shutdown checkpoint.
>
> I had a reason why I didn't do that, but it seems to have slipped my mind.
>
> If I can't remember, I'll add it.

I think it was simply to keep things simple and avoid bugs in this release.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-25 12:15:12
Message-ID: 51027750.5070307@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24.01.2013 19:44, Simon Riggs wrote:
> On 24 January 2013 16:52, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com> wrote:
>> I may be missing something, but it looks like after a "fast" promotion, you
>> don't request a new checkpoint. So it can take quite a while for the next
>> checkpoint to be triggered by checkpoint_timeout/segments. That shouldn't be
>> a problem, but I feel that it'd be prudent to request a new checkpoint
>> immediately (not necessarily an "immediate" checkpoint, though).
>
> I thought of that and there is a long comment to explain why I didn't.
>
> Two problems:
>
> 1) an immediate checkpoint can cause a disk/resource usage spike,
> which is definitely not what you need just when a spike of connections
> and new SQL hits the system.

It doesn't need to be an "immediate" checkpoint, ie. you don't need to
rush through it with checkpoint_completion_target=0. I think you should
initiate a regular, slow, checkpoint, right after writing the
end-of-recovery record. It can take some time for it to finish, which is ok.

There's no hard correctness reason here for any particular behavior, I
just feel that that would make most sense. It seems prudent to initiate
a checkpoint right after timeline switch, so that you get a new
checkpoint on the new timeline fairly soon - it could take up to
checkpoint_timeout otherwise, but there's no terrible rush to finish it
ASAP.

- Heikki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-25 13:26:30
Message-ID: CA+U5nMJzBCsutE+ZBqsDSgNf_NNq983WdxSFt3=rj6ppbUuEXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25 January 2013 12:15, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

>> 1) an immediate checkpoint can cause a disk/resource usage spike,
>> which is definitely not what you need just when a spike of connections
>> and new SQL hits the system.
>
>
> It doesn't need to be an "immediate" checkpoint, ie. you don't need to rush
> through it with checkpoint_completion_target=0. I think you should initiate
> a regular, slow, checkpoint, right after writing the end-of-recovery record.
> It can take some time for it to finish, which is ok.

OK, will add.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Skip checkpoint on promoting from streaming replication
Date: 2013-01-25 15:20:07
Message-ID: 11546.1359127207@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> There's no hard correctness reason here for any particular behavior, I
> just feel that that would make most sense. It seems prudent to initiate
> a checkpoint right after timeline switch, so that you get a new
> checkpoint on the new timeline fairly soon - it could take up to
> checkpoint_timeout otherwise, but there's no terrible rush to finish it
> ASAP.

+1. The way I would think about it is that we're switching from a
checkpointing regime appropriate to a slave to one appropriate to a
master. If the last restartpoint was far back, compared to the
configured checkpoint timing for master operation, we're at risk that a
crash could take longer than desired to recover. So we ought to embark
right away on a fresh checkpoint, but do it in the same way it would be
done in normal master operation (thus, not immediate). Once it's done
we'll be in the expected checkpointing state for a master.

regards, tom lane