time-delayed standbys

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: time-delayed standbys
Date: 2011-04-20 02:47:16
Message-ID: BANLkTi==TTzHDqWzwJDjmOf__8YuA7L1jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While I was out at the MySQL conference last week, I heard that one of
the forthcoming MySQL features is "time-delayed replication":

http://forge.mysql.com/worklog/task.php?id=344

That is, a standby configured such that replay lags a prescribed
amount of time behind the master.

This seemed easy to implement, so I did. Patch (for 9.2, obviously) attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
time-delayed-standby.patch application/octet-stream 5.8 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 11:24:25
Message-ID: BANLkTi=QScpDTkUw3iw6JaO15uUvx_P4qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 3:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> While I was out at the MySQL conference last week, I heard that one of
> the forthcoming MySQL features is "time-delayed replication":

Incidentally, this is a popular Oracle feature. It's a poor man's
"flashback" and similar to how some filesystems automatically create
regular snapshots of every home directory so you can get back stuff at
some arbitrary point in the past.

I haven't read the patch but are you delaying delivering the log or
delaying replaying it? I think you actually want the latter so in case
of a real failure you can choose between replaying the last 5 minutes
and recovering everything or intentionally dropping that history if
the failure was caused by an application problem.

--
greg


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 11:29:49
Message-ID: BANLkTiko1m--vOO6xURWzLyPZyuMbrrYaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

yesterday I was asked about this feature

+1

Regards

Pavel

2011/4/20 Robert Haas <robertmhaas(at)gmail(dot)com>:
> While I was out at the MySQL conference last week, I heard that one of
> the forthcoming MySQL features is "time-delayed replication":
>
> http://forge.mysql.com/worklog/task.php?id=344
>
> That is, a standby configured such that replay lags a prescribed
> amount of time behind the master.
>
> This seemed easy to implement, so I did.  Patch (for 9.2, obviously) attached.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 11:42:56
Message-ID: BANLkTimarB9eLhHpW73ja2oGvCaWNQuD6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 12:24 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> I haven't read the patch but are you delaying delivering the log or
> delaying replaying it? I think you actually want the latter so in case
> of a real failure you can choose between replaying the last 5 minutes
> and recovering everything or intentionally dropping that history if
> the failure was caused by an application problem.
>

Ok, so I read the patch. That wasn't so hard.

You did implement it the way I described so great.

However you did it by comparing the server timestamp and the slave
timestamp. I think we should avoid that. Synchronizing clocks is a
surprisingly hard problem and it would really be great if the whole
system kept working if the clocks were out of sync.

I don't think that's impossible, though it's a bit inconvenient. We
would need to keep track of the most recent timestamp received from
the master and pause recovery if we reach recovery on a record that's
less than five minutes older than that record. This means we're
comparing only master timestamps to each other. It does mean if the
log transfer lags then recovery will lag unnecessarily. And it means
the process of receiving logs is complicated by having to keep track
of this timestamp.

--
greg


From: Jerry Sievers <gsievers19(at)comcast(dot)net>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 12:49:10
Message-ID: 87wriphyux.fsf@comcast.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:

> On Wed, Apr 20, 2011 at 3:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> While I was out at the MySQL conference last week, I heard that one of
>> the forthcoming MySQL features is "time-delayed replication":
>
> Incidentally, this is a popular Oracle feature. It's a poor man's
> "flashback" and similar to how some filesystems automatically create
> regular snapshots of every home directory so you can get back stuff at
> some arbitrary point in the past.

Yup.

One of the big bosses where I work went asking for this a couple years
ago. We're multi-platform; Oracle, MySQL, EnterpriseDB, Vertica.

They put a 6 hour delay on the critical Oracle boxes. NOthing was done
for MySQL or Vertica since no feature support.

My C-foo being rusty, I elected to add more hacks to our home-spun
PYthon version of pg_standby rather than adding the feature to
pg_standby itself.

Been running with delayed WAL apply ever since.. Shipping happens
immediatly on log creation at master and we simply wait for the files to
age the configured time before application.

In a few cases, we have 2 or more standbys off the same prod master.
One in real-time mode and the others lagged.

Thanks all!

--
Jerry Sievers
Postgres DBA/Development Consulting
e: gsievers19(at)comcast(dot)net
p: 305.321.1144


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 13:19:50
Message-ID: BANLkTi=hR3vGUDX-SQf=nFcL9ycmrQ6Fcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 7:42 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Apr 20, 2011 at 12:24 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> I haven't read the patch but are you delaying delivering the log or
>> delaying replaying it? I think you actually want the latter so in case
>> of a real failure you can choose between replaying the last 5 minutes
>> and recovering everything or intentionally dropping that history if
>> the failure was caused by an application problem.
>>
>
> Ok, so I read the patch. That wasn't so hard.
>
> You did implement it the way I described so great.
>
> However you did it by comparing the server timestamp and the slave
> timestamp. I think we should avoid that. Synchronizing clocks is a
> surprisingly hard problem and it would really be great if the whole
> system kept working if the clocks were out of sync.
>
> I don't think that's impossible, though it's a bit inconvenient. We
> would need to keep track of the most recent timestamp received from
> the master and pause recovery if we reach recovery on a record that's
> less than five minutes older than that record. This means we're
> comparing only master timestamps to each other. It does mean if the
> log transfer lags then recovery will lag unnecessarily. And it means
> the process of receiving logs is complicated by having to keep track
> of this timestamp.

How would we keep track of the most recent timestamp received from the
master without replaying the WAL records?

What happens if the connection to the master is lost?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 14:07:38
Message-ID: BANLkTikPdjn98yjSR-VYywyjMe64wdrdDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> How would we keep track of the most recent timestamp received from the
> master without replaying the WAL records?

Well as we receive them we would have to peek at them to see the time.
Or we could have the master send its time to the slave as I believe
Tom has already proposed in the past but I believe didn't choose to
do? I haven't really been following all the twists and turns of
streaming replication and sync replication. I read the e-mails but
fell behind a few times so I'm not sure what the end result was
sometimes.

> What happens if the connection to the master is lost?

We're writing these records to disk no? I don't understand the
problem. It doesn't even matter if the records get lost and have to be
retrieved again, as long as we gate the recovery of the records based
on a time we know the master has seen sometime.

Oh, I think I see what you're getting at. We would never replay the
last five minutes if there's no connection. We definitely want to make
sure the admin is aware of the lag and can disable this feature before
bringing up the slave.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 14:19:35
Message-ID: 26007.1303309175@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> How would we keep track of the most recent timestamp received from the
>> master without replaying the WAL records?

> Well as we receive them we would have to peek at them to see the time.
> Or we could have the master send its time to the slave as I believe
> Tom has already proposed in the past but I believe didn't choose to
> do?

My idea of how to manage it would be to have walreceiver explicitly
track the clock difference from the master, which it can do since
walsender puts its current time into every message header. You can use
the slave's clock for comparisons so long as you add the appropriate
offset. (The theory here is that the skew won't change very fast ...)

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 14:25:09
Message-ID: BANLkTimTm48ZKYrj4jGDCYuqWBUqZb+q2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 3:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> My idea of how to manage it would be to have walreceiver explicitly
> track the clock difference from the master, which it can do since
> walsender puts its current time into every message header.  You can use
> the slave's clock for comparisons so long as you add the appropriate
> offset.  (The theory here is that the skew won't change very fast ...)
>

Ah, so we did put the master's clock in every message? Then this
should be simple, no? Just compare the master's timestamp from the
record to the last master's clock seen in the messages. That sounds
equivalent but a lot safer than trying to keep a conversion between
them.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 14:41:31
Message-ID: 26404.1303310491@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Ah, so we did put the master's clock in every message?

Yes, we did.

> Then this
> should be simple, no? Just compare the master's timestamp from the
> record to the last master's clock seen in the messages. That sounds
> equivalent but a lot safer than trying to keep a conversion between
> them.

Well, the question is what happens after you stop receiving master
messages. If you don't make use of the slave's clock somehow,
application of WAL will stop dead in the water, which seems unlikely
to be what's wanted.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 15:05:43
Message-ID: BANLkTikqGottQvyBzrXsARkaWMB2OEMGEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 10:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> On Wed, Apr 20, 2011 at 2:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> How would we keep track of the most recent timestamp received from the
>>> master without replaying the WAL records?
>
>> Well as we receive them we would have to peek at them to see the time.
>> Or we could have the master send its time to the slave as I believe
>> Tom has already proposed in the past but I believe didn't choose to
>> do?
>
> My idea of how to manage it would be to have walreceiver explicitly
> track the clock difference from the master, which it can do since
> walsender puts its current time into every message header.  You can use
> the slave's clock for comparisons so long as you add the appropriate
> offset.  (The theory here is that the skew won't change very fast ...)

I am a bit concerned about the reliability of this approach. If there
is some network lag, or some lag in processing from the master, we
could easily get the idea that there is time skew between the machines
when there really isn't. And our perception of the time skew could
easily bounce around from message to message, as the lag varies. I
think it would be tremendously ironic of the two machines were
actually synchronized to the microsecond, but by trying to be clever
about it we managed to make the lag-time accurate only to within
several seconds.

It looks like the MySQL version of this feature is spec'd to have some
sort of time skew compensation built into it, but I don't see anything
that explains how it actually works.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 15:15:06
Message-ID: 27149.1303312506@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I am a bit concerned about the reliability of this approach. If there
> is some network lag, or some lag in processing from the master, we
> could easily get the idea that there is time skew between the machines
> when there really isn't. And our perception of the time skew could
> easily bounce around from message to message, as the lag varies. I
> think it would be tremendously ironic of the two machines were
> actually synchronized to the microsecond, but by trying to be clever
> about it we managed to make the lag-time accurate only to within
> several seconds.

Well, if walreceiver concludes that there is no more than a few seconds'
difference between the clocks, it'd probably be OK to take the master
timestamps at face value. The problem comes when the skew gets large
(compared to the configured time delay, I guess).

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-20 15:52:22
Message-ID: BANLkTimQG+58K2g04benfc4OLW=AhJo2cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 3:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> Ah, so we did put the master's clock in every message?
>
> Yes, we did.

And by "we" I mean "you".... I realize I'm tossing in comments from
the peanut gallery to you and especially Robert who worked on this
stuff a lot already.

>> Then this
>> should be simple, no? Just compare the master's timestamp from the
>> record to the last master's clock seen in the messages. That sounds
>> equivalent but a lot safer than trying to keep a conversion between
>> them.
>
> Well, the question is what happens after you stop receiving master
> messages.  If you don't make use of the slave's clock somehow,
> application of WAL will stop dead in the water, which seems unlikely
> to be what's wanted.

I'm not convinced that's so bad. But even so the logic could be:

wait until (master.last_time_seen > this_record.master-timestamp+n minutes ||
gettimeofday() > this_record.local_time_when_received+n minutes)

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-21 03:18:42
Message-ID: BANLkTin=22tyhK5zaNm7Nt-Q+MHcsNHdLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 20, 2011 at 11:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I am a bit concerned about the reliability of this approach.  If there
>> is some network lag, or some lag in processing from the master, we
>> could easily get the idea that there is time skew between the machines
>> when there really isn't.  And our perception of the time skew could
>> easily bounce around from message to message, as the lag varies.  I
>> think it would be tremendously ironic of the two machines were
>> actually synchronized to the microsecond, but by trying to be clever
>> about it we managed to make the lag-time accurate only to within
>> several seconds.
>
> Well, if walreceiver concludes that there is no more than a few seconds'
> difference between the clocks, it'd probably be OK to take the master
> timestamps at face value.  The problem comes when the skew gets large
> (compared to the configured time delay, I guess).

I suppose. Any bound on how much lag there can be before we start
applying to skew correction is going to be fairly arbitrary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-04-24 01:46:54
Message-ID: BANLkTikHcD8H=iS=xZCTnyNAYMbDu3Tj1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> That is, a standby configured such that replay lags a prescribed
> amount of time behind the master.
>
> This seemed easy to implement, so I did.  Patch (for 9.2, obviously) attached.
>

This crashes when stoping recovery to a target (i tried with a named
restore point and with a poin in time) after executing
pg_xlog_replay_resume(). here is the backtrace. I will try to check
later but i wanted to report it before...

#0 0xb7777537 in raise () from /lib/libc.so.6
#1 0xb777a922 in abort () from /lib/libc.so.6
#2 0x08393a19 in errfinish (dummy=0) at elog.c:513
#3 0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver
still active") at elog.c:1156
#4 0x080f04cb in StartupXLOG () at xlog.c:6691
#5 0x080f2825 in StartupProcessMain () at xlog.c:10050
#6 0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at
bootstrap.c:417
#7 0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488
#8 0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106
#9 0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-07 13:48:06
Message-ID: BANLkTinfrgsVK_8o9+mw6kWRcC4BxiR4jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 23, 2011 at 9:46 PM, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> That is, a standby configured such that replay lags a prescribed
>> amount of time behind the master.
>>
>> This seemed easy to implement, so I did.  Patch (for 9.2, obviously) attached.
>>
>
> This crashes when stoping recovery to a target (i tried with a named
> restore point and with a poin in time) after executing
> pg_xlog_replay_resume(). here is the backtrace. I will try to check
> later but i wanted to report it before...
>
> #0  0xb7777537 in raise () from /lib/libc.so.6
> #1  0xb777a922 in abort () from /lib/libc.so.6
> #2  0x08393a19 in errfinish (dummy=0) at elog.c:513
> #3  0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver
> still active") at elog.c:1156
> #4  0x080f04cb in StartupXLOG () at xlog.c:6691
> #5  0x080f2825 in StartupProcessMain () at xlog.c:10050
> #6  0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at
> bootstrap.c:417
> #7  0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488
> #8  0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106
> #9  0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199

Sorry for the slow response on this - I was on vacation for a week and
my schedule got a big hole in it.

I was able to reproduce something very like this in unpatched master,
just by letting recovery pause at a named restore point, and then
resuming it.

LOG: recovery stopping at restore point "stop", time 2011-05-07
09:28:01.652958-04
LOG: recovery has paused
HINT: Execute pg_xlog_replay_resume() to continue.
(at this point I did pg_xlog_replay_resume())
LOG: redo done at 0/5000020
PANIC: wal receiver still active
LOG: startup process (PID 38762) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes

I'm thinking that this code is wrong:

if (recoveryPauseAtTarget && standbyState ==
STANDBY_SNAPSHOT_READY)
{
SetRecoveryPause(true);
recoveryPausesHere();
}
reachedStopPoint = true; /* see below */
recoveryContinue = false;

I think that recoveryContinue = false assignment should not happen if
we decide to pause. That is, we should say if (recoveryPauseAtTarget
&& standbyState == STANDBY_SNAPSHOT_READY) { same as now } else
recoveryContinue = false.

I haven't tested that, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-11 05:29:48
Message-ID: BANLkTimDUswEE5nAjr31DQ=6GxRPU758kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, May 7, 2011 at 10:48 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I was able to reproduce something very like this in unpatched master,
> just by letting recovery pause at a named restore point, and then
> resuming it.

I was able to reproduce the same problem even in 9.0. When the standby
reaches the recovery target, it always tries to end the recovery even
though walreceiver is still running, which causes the problem. This seems
to be an oversight in streaming replication. I should have considered how
the standby should work when recovery_target is specified.

What about the attached patch? Which stops walreceiver instead of
emitting PANIC there only if we've reached the recovery target.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
recovery_target_v1.patch application/octet-stream 1.1 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-11 09:27:02
Message-ID: 4DCA5666.50506@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07.05.2011 16:48, Robert Haas wrote:
> I was able to reproduce something very like this in unpatched master,
> just by letting recovery pause at a named restore point, and then
> resuming it.
>
> LOG: recovery stopping at restore point "stop", time 2011-05-07
> 09:28:01.652958-04
> LOG: recovery has paused
> HINT: Execute pg_xlog_replay_resume() to continue.
> (at this point I did pg_xlog_replay_resume())
> LOG: redo done at 0/5000020
> PANIC: wal receiver still active
> LOG: startup process (PID 38762) was terminated by signal 6: Abort trap
> LOG: terminating any other active server processes
>
> I'm thinking that this code is wrong:
>
> if (recoveryPauseAtTarget&& standbyState ==
> STANDBY_SNAPSHOT_READY)
> {
> SetRecoveryPause(true);
> recoveryPausesHere();
> }
> reachedStopPoint = true; /* see below */
> recoveryContinue = false;
>
> I think that recoveryContinue = false assignment should not happen if
> we decide to pause. That is, we should say if (recoveryPauseAtTarget
> && standbyState == STANDBY_SNAPSHOT_READY) { same as now } else
> recoveryContinue = false.

No, recovery stops at that point whether or not you pause. Resuming
after stopping at the recovery target doesn't mean that you resume
recovery, it means that you resume to end recovery and start up the
server (see the 2nd to last paragraph at
http://www.postgresql.org/docs/9.1/static/recovery-target-settings.html). It
would probably be more useful to allow a new stopping target to be set
and continue recovery, but the current pause/resume functions don't
allow that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-11 11:16:11
Message-ID: BANLkTin4bCFrFX-n8JVG7QfzdYhVDBWnJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 11, 2011 at 6:50 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I think we can just always call ShutdownWalRcv(). It should be gone if the
> server was promoted while streaming, but that's just an implementation
> detail of what the promotion code does. There's no hard reason why it
> shouldn't be running at that point anymore, as long as we kill it before
> going any further.

Okay. But I'd like to add the following assertion check just before
ShutdownWalRcv() which you added, in order to detect such a bug
that we found this time, i.e., the bug which causes unexpected end
of recovery. Thought?

Assert(reachedStopPoint || !WalRcvInProgress())

> Committed a patch to do that.

Thanks. Should we backport it to 9.0? 9.0 has the same problem.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-11 11:53:41
Message-ID: 4DCA78C5.4040803@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.05.2011 14:16, Fujii Masao wrote:
> On Wed, May 11, 2011 at 6:50 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I think we can just always call ShutdownWalRcv(). It should be gone if the
>> server was promoted while streaming, but that's just an implementation
>> detail of what the promotion code does. There's no hard reason why it
>> shouldn't be running at that point anymore, as long as we kill it before
>> going any further.
>
> Okay. But I'd like to add the following assertion check just before
> ShutdownWalRcv() which you added, in order to detect such a bug
> that we found this time, i.e., the bug which causes unexpected end
> of recovery. Thought?
>
> Assert(reachedStopPoint || !WalRcvInProgress())

There's no unexpected end of recovery here. The recovery ends when we
reach the target, as it should. It was the assumption that WAL receiver
can't be running at that point anymore that was wrong.

That assertion would work, AFAICS, but I don't think it's something we
need to assert. There isn't any harm done if WAL receiver is still
running, as long as we shut it down at that point.

>> Committed a patch to do that.
>
> Thanks. Should we backport it to 9.0? 9.0 has the same problem.

Ah, thanks, missed that, Cherry-picked to 9.0 now as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-15 05:58:43
Message-ID: BANLkTikFhVOhmf_XKERe=MkU3Ht_K7posQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 21, 2011 at 12:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Apr 20, 2011 at 11:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> I am a bit concerned about the reliability of this approach.  If there
>>> is some network lag, or some lag in processing from the master, we
>>> could easily get the idea that there is time skew between the machines
>>> when there really isn't.  And our perception of the time skew could
>>> easily bounce around from message to message, as the lag varies.  I
>>> think it would be tremendously ironic of the two machines were
>>> actually synchronized to the microsecond, but by trying to be clever
>>> about it we managed to make the lag-time accurate only to within
>>> several seconds.
>>
>> Well, if walreceiver concludes that there is no more than a few seconds'
>> difference between the clocks, it'd probably be OK to take the master
>> timestamps at face value.  The problem comes when the skew gets large
>> (compared to the configured time delay, I guess).
>
> I suppose.  Any bound on how much lag there can be before we start
> applying to skew correction is going to be fairly arbitrary.

When the replication connection is terminated, the standby tries to read
WAL files from the archive. In this case, there is no walreceiver process,
so how does the standby calculate the clock difference?

> errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"),

We should s/"a temporal"/"an Integer"?

After we run "pg_ctl promote", time-delayed replication should be disabled?
Otherwise, failover might take very long time when we set recovery_time_delay
to high value.

http://forge.mysql.com/worklog/task.php?id=344
According to the above page, one purpose of time-delayed replication is to
protect against user mistakes on master. But, when an user notices his wrong
operation on master, what should he do next? The WAL records of his wrong
operation might have already arrived at the standby, so neither "promote" nor
"restart" doesn't cancel that wrong operation. Instead, probably he should
shutdown the standby, investigate the timestamp of XID of the operation
he'd like to cancel, set recovery_target_time and restart the standby.
Something like this procedures should be documented? Or, we should
implement new "promote" mode which finishes a recovery as soon as
"promote" is requested (i.e., not replay all the available WAL records)?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-15 06:14:42
Message-ID: BANLkTi=OuMtTXwj9j9iVasb6Ee+EnSHaVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 15, 2011 at 12:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> http://forge.mysql.com/worklog/task.php?id=344
> According to the above page, one purpose of time-delayed replication is to
> protect against user mistakes on master. But, when an user notices his wrong
> operation on master, what should he do next? The WAL records of his wrong
> operation might have already arrived at the standby, so neither "promote" nor
> "restart" doesn't cancel that wrong operation. Instead, probably he should
> shutdown the standby, investigate the timestamp of XID of the operation
> he'd like to cancel, set recovery_target_time and restart the standby.
> Something like this procedures should be documented? Or, we should
> implement new "promote" mode which finishes a recovery as soon as
> "promote" is requested (i.e., not replay all the available WAL records)?
>

i would prefer something like "pg_ctl promote -m immediate" that
terminates the recovery

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-16 18:29:43
Message-ID: BANLkTi=COgk+8cieWxhM7YN9guYXOrDb_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> When the replication connection is terminated, the standby tries to read
> WAL files from the archive. In this case, there is no walreceiver process,
> so how does the standby calculate the clock difference?

Good question. Also, just because we have streaming replication
available doesn't mean that we should force people to use it. It's
still perfectly legit to set up a standby that only use
archive_command and restore_command, and it would be nice if this
feature could still work in such an environment. I anticipate that
most people want to use streaming replication, but a time-delayed
standby is a good example of a case where you might decide you don't
need it. It could be useful to have all the WAL present (but not yet
applied) if you're thinking you might want to promote that standby -
but my guess is that in many cases, the time-delayed standby will be
*in addition* to one or more regular standbys that would be the
primary promotion candidates. So I can see someone deciding that
they'd rather not have the load of another walsender on the master,
and just let the time-delayed standby read from the archive.

Even if that were not an issue, I'm still more or less of the opinion
that trying to solve the time synchronization problem is a rathole
anyway. To really solve this problem well, you're going to need the
standby to send a message containing a timestamp, get a reply back
from the master that contains that timestamp and a master timestamp,
and then compute based on those two timestamps plus the reply
timestamp the maximum and minimum possible lag between the two
machines. Then you're going to need to guess, based on several cycles
of this activity, what the actual lag is, and adjust it over time (but
not too quckly, unless of course a large manual step has occurred) as
the clocks potentially drift apart from each other. This is basically
what ntpd does, except that it can be virtually guaranteed that our
implementation will suck by comparison. Time synchronization is
neither easy nor our core competency, and I think trying to include it
in this feature is going to result in a net loss of reliability.

>> errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"),
>
> We should s/"a temporal"/"an Integer"?

It seems strange to ask for an integer when what we want is an amount
of time in seconds or minutes...

> After we run "pg_ctl promote", time-delayed replication should be disabled?
> Otherwise, failover might take very long time when we set recovery_time_delay
> to high value.

Yeah, I think so.

> http://forge.mysql.com/worklog/task.php?id=344
> According to the above page, one purpose of time-delayed replication is to
> protect against user mistakes on master. But, when an user notices his wrong
> operation on master, what should he do next? The WAL records of his wrong
> operation might have already arrived at the standby, so neither "promote" nor
> "restart" doesn't cancel that wrong operation. Instead, probably he should
> shutdown the standby, investigate the timestamp of XID of the operation
> he'd like to cancel, set recovery_target_time and restart the standby.
> Something like this procedures should be documented? Or, we should
> implement new "promote" mode which finishes a recovery as soon as
> "promote" is requested (i.e., not replay all the available WAL records)?

I like the idea of a new promote mode; and documenting the other
approach you mention doesn't sound bad either. Either one sounds like
a job for a separate patch, though.

The other option is to pause recovery and run pg_dump...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-17 02:10:24
Message-ID: BANLkTikSyWhLtK2eUE=e0Sg0dcBRnjongw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 17, 2011 at 3:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Even if that were not an issue, I'm still more or less of the opinion
> that trying to solve the time synchronization problem is a rathole
> anyway.  To really solve this problem well, you're going to need the
> standby to send a message containing a timestamp, get a reply back
> from the master that contains that timestamp and a master timestamp,
> and then compute based on those two timestamps plus the reply
> timestamp the maximum and minimum possible lag between the two
> machines.  Then you're going to need to guess, based on several cycles
> of this activity, what the actual lag is, and adjust it over time (but
> not too quckly, unless of course a large manual step has occurred) as
> the clocks potentially drift apart from each other.  This is basically
> what ntpd does, except that it can be virtually guaranteed that our
> implementation will suck by comparison.  Time synchronization is
> neither easy nor our core competency, and I think trying to include it
> in this feature is going to result in a net loss of reliability.

Agreed. You've already added the note about time synchronization into
the document. That's enough, I think.

>>> errmsg("parameter \"%s\" requires a temporal value", "recovery_time_delay"),
>>
>> We should s/"a temporal"/"an Integer"?
>
> It seems strange to ask for an integer when what we want is an amount
> of time in seconds or minutes...

OK.

>> http://forge.mysql.com/worklog/task.php?id=344
>> According to the above page, one purpose of time-delayed replication is to
>> protect against user mistakes on master. But, when an user notices his wrong
>> operation on master, what should he do next? The WAL records of his wrong
>> operation might have already arrived at the standby, so neither "promote" nor
>> "restart" doesn't cancel that wrong operation. Instead, probably he should
>> shutdown the standby, investigate the timestamp of XID of the operation
>> he'd like to cancel, set recovery_target_time and restart the standby.
>> Something like this procedures should be documented? Or, we should
>> implement new "promote" mode which finishes a recovery as soon as
>> "promote" is requested (i.e., not replay all the available WAL records)?
>
> I like the idea of a new promote mode;

Are you going to implement that mode in this CF? or next one?

> and documenting the other
> approach you mention doesn't sound bad either.  Either one sounds like
> a job for a separate patch, though.
>
> The other option is to pause recovery and run pg_dump...

Yes, please.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-17 02:34:17
Message-ID: BANLkTikuQGq-LVEOa9bne5Ne7HyW5UtunA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 16, 2011 at 10:10 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> According to the above page, one purpose of time-delayed replication is to
>>> protect against user mistakes on master. But, when an user notices his wrong
>>> operation on master, what should he do next? The WAL records of his wrong
>>> operation might have already arrived at the standby, so neither "promote" nor
>>> "restart" doesn't cancel that wrong operation. Instead, probably he should
>>> shutdown the standby, investigate the timestamp of XID of the operation
>>> he'd like to cancel, set recovery_target_time and restart the standby.
>>> Something like this procedures should be documented? Or, we should
>>> implement new "promote" mode which finishes a recovery as soon as
>>> "promote" is requested (i.e., not replay all the available WAL records)?
>>
>> I like the idea of a new promote mode;
>
> Are you going to implement that mode in this CF? or next one?

I wasn't really planning on it - I thought you might want to take a
crack at it. The feature is usable without that, just maybe a bit
less cool. Certainly, it's too late for any more formal submissions
to this CF, but I wouldn't mind reviewing a patch if you want to write
one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-20 01:57:28
Message-ID: BANLkTinzm-kiO8vd1AZ9A0YkUdu-NV0OiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 17, 2011 at 11:34 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jun 16, 2011 at 10:10 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> According to the above page, one purpose of time-delayed replication is to
>>>> protect against user mistakes on master. But, when an user notices his wrong
>>>> operation on master, what should he do next? The WAL records of his wrong
>>>> operation might have already arrived at the standby, so neither "promote" nor
>>>> "restart" doesn't cancel that wrong operation. Instead, probably he should
>>>> shutdown the standby, investigate the timestamp of XID of the operation
>>>> he'd like to cancel, set recovery_target_time and restart the standby.
>>>> Something like this procedures should be documented? Or, we should
>>>> implement new "promote" mode which finishes a recovery as soon as
>>>> "promote" is requested (i.e., not replay all the available WAL records)?
>>>
>>> I like the idea of a new promote mode;
>>
>> Are you going to implement that mode in this CF? or next one?
>
> I wasn't really planning on it - I thought you might want to take a
> crack at it.  The feature is usable without that, just maybe a bit
> less cool.

Right.

> Certainly, it's too late for any more formal submissions
> to this CF, but I wouldn't mind reviewing a patch if you want to write
> one.

Okay, I add that into my TODO list. But I might not have enough time
to develop that.
So, everyone, please feel free to implement that if you want!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-29 02:14:56
Message-ID: BANLkTimga2ysehLSH=us9mTOWBQw1tk+8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> After we run "pg_ctl promote", time-delayed replication should be disabled?
> Otherwise, failover might take very long time when we set recovery_time_delay
> to high value.

PFA a patch that I believe will disable recovery_time_delay after
promotion. The only change from the previous version is:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog
index 1dbf792..41b3ae9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5869,7 +5869,7 @@ pg_is_xlog_replay_paused(PG_FUNCTION_ARGS)
static void
recoveryDelay(void)
{
- while (1)
+ while (!CheckForStandbyTrigger())
{
long secs;
int microsecs;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
time-delayed-standby-v2.patch application/octet-stream 6.0 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-29 07:46:05
Message-ID: BANLkTikHVwbCRjGkuMCXQcJnW7RYVP0zAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 15, 2011 at 6:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> Or, we should
> implement new "promote" mode which finishes a recovery as soon as
> "promote" is requested (i.e., not replay all the available WAL records)?

That's not a new feature. We had it in 8.4, but it was removed.

Originally, we supported "fast" failover via trigger file.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-29 08:00:40
Message-ID: BANLkTingeD+XrxN-gDv0pzUMHrPcHWSF+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> When the replication connection is terminated, the standby tries to read
>> WAL files from the archive. In this case, there is no walreceiver process,
>> so how does the standby calculate the clock difference?
>
> Good question.  Also, just because we have streaming replication
> available doesn't mean that we should force people to use it.  It's
> still perfectly legit to set up a standby that only use
> archive_command and restore_command, and it would be nice if this
> feature could still work in such an environment.  I anticipate that
> most people want to use streaming replication, but a time-delayed
> standby is a good example of a case where you might decide you don't
> need it.  It could be useful to have all the WAL present (but not yet
> applied) if you're thinking you might want to promote that standby -
> but my guess is that in many cases, the time-delayed standby will be
> *in addition* to one or more regular standbys that would be the
> primary promotion candidates.  So I can see someone deciding that
> they'd rather not have the load of another walsender on the master,
> and just let the time-delayed standby read from the archive.
>
> Even if that were not an issue, I'm still more or less of the opinion
> that trying to solve the time synchronization problem is a rathole
> anyway.  To really solve this problem well, you're going to need the
> standby to send a message containing a timestamp, get a reply back
> from the master that contains that timestamp and a master timestamp,
> and then compute based on those two timestamps plus the reply
> timestamp the maximum and minimum possible lag between the two
> machines.  Then you're going to need to guess, based on several cycles
> of this activity, what the actual lag is, and adjust it over time (but
> not too quckly, unless of course a large manual step has occurred) as
> the clocks potentially drift apart from each other.  This is basically
> what ntpd does, except that it can be virtually guaranteed that our
> implementation will suck by comparison.  Time synchronization is
> neither easy nor our core competency, and I think trying to include it
> in this feature is going to result in a net loss of reliability.

This begs the question of why we need this feature at all, in the way proposed.

Streaming replication is designed for immediate transfer of WAL. File
based is more about storing them for some later use.

It seems strange to pollute the *immediate* transfer route with a
delay, when that is easily possible with a small patch to pg_standby
that can wait until the filetime delay is > X before returning.

The main practical problem with this is that most people's WAL
partitions aren't big enough to store the delayed WAL files, which is
why we provide the file archiving route anyway. So in practical terms
this will be unusable, or at least dangerous to use.

+1 for the feature concept, but -1 for adding this to streaming replication.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-29 12:24:31
Message-ID: BANLkTim8rKuUbDP=TxWmtfvZk8C1vmgmsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 4:00 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> When the replication connection is terminated, the standby tries to read
>>> WAL files from the archive. In this case, there is no walreceiver process,
>>> so how does the standby calculate the clock difference?
>>
>> Good question.  Also, just because we have streaming replication
>> available doesn't mean that we should force people to use it.  It's
>> still perfectly legit to set up a standby that only use
>> archive_command and restore_command, and it would be nice if this
>> feature could still work in such an environment.  I anticipate that
>> most people want to use streaming replication, but a time-delayed
>> standby is a good example of a case where you might decide you don't
>> need it.  It could be useful to have all the WAL present (but not yet
>> applied) if you're thinking you might want to promote that standby -
>> but my guess is that in many cases, the time-delayed standby will be
>> *in addition* to one or more regular standbys that would be the
>> primary promotion candidates.  So I can see someone deciding that
>> they'd rather not have the load of another walsender on the master,
>> and just let the time-delayed standby read from the archive.
>>
>> Even if that were not an issue, I'm still more or less of the opinion
>> that trying to solve the time synchronization problem is a rathole
>> anyway.  To really solve this problem well, you're going to need the
>> standby to send a message containing a timestamp, get a reply back
>> from the master that contains that timestamp and a master timestamp,
>> and then compute based on those two timestamps plus the reply
>> timestamp the maximum and minimum possible lag between the two
>> machines.  Then you're going to need to guess, based on several cycles
>> of this activity, what the actual lag is, and adjust it over time (but
>> not too quckly, unless of course a large manual step has occurred) as
>> the clocks potentially drift apart from each other.  This is basically
>> what ntpd does, except that it can be virtually guaranteed that our
>> implementation will suck by comparison.  Time synchronization is
>> neither easy nor our core competency, and I think trying to include it
>> in this feature is going to result in a net loss of reliability.
>
>
> This begs the question of why we need this feature at all, in the way proposed.
>
> Streaming replication is designed for immediate transfer of WAL. File
> based is more about storing them for some later use.
>
> It seems strange to pollute the *immediate* transfer route with a
> delay, when that is easily possible with a small patch to pg_standby
> that can wait until the filetime delay is > X before returning.
>
> The main practical problem with this is that most people's WAL
> partitions aren't big enough to store the delayed WAL files, which is
> why we provide the file archiving route anyway. So in practical terms
> this will be unusable, or at least dangerous to use.
>
> +1 for the feature concept, but -1 for adding this to streaming replication.

As implemented, the feature will work with either streaming
replication or with file-based replication. I don't see any value in
restricting to work ONLY with file-based replication.

Also, if we were to do it by making pg_standby wait, then the whole
thing would be much less accurate, and the delay would become much
harder to predict, because you'd be operating on the level of entire
WAL segments, rather than individual commit records.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-29 17:50:22
Message-ID: BANLkTi=ufQk9i2Ria1U-c0VDUMUNxKjPgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 1:24 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 29, 2011 at 4:00 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Thu, Jun 16, 2011 at 7:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> When the replication connection is terminated, the standby tries to read
>>>> WAL files from the archive. In this case, there is no walreceiver process,
>>>> so how does the standby calculate the clock difference?
>>>
>>> Good question.  Also, just because we have streaming replication
>>> available doesn't mean that we should force people to use it.  It's
>>> still perfectly legit to set up a standby that only use
>>> archive_command and restore_command, and it would be nice if this
>>> feature could still work in such an environment.  I anticipate that
>>> most people want to use streaming replication, but a time-delayed
>>> standby is a good example of a case where you might decide you don't
>>> need it.  It could be useful to have all the WAL present (but not yet
>>> applied) if you're thinking you might want to promote that standby -
>>> but my guess is that in many cases, the time-delayed standby will be
>>> *in addition* to one or more regular standbys that would be the
>>> primary promotion candidates.  So I can see someone deciding that
>>> they'd rather not have the load of another walsender on the master,
>>> and just let the time-delayed standby read from the archive.
>>>
>>> Even if that were not an issue, I'm still more or less of the opinion
>>> that trying to solve the time synchronization problem is a rathole
>>> anyway.  To really solve this problem well, you're going to need the
>>> standby to send a message containing a timestamp, get a reply back
>>> from the master that contains that timestamp and a master timestamp,
>>> and then compute based on those two timestamps plus the reply
>>> timestamp the maximum and minimum possible lag between the two
>>> machines.  Then you're going to need to guess, based on several cycles
>>> of this activity, what the actual lag is, and adjust it over time (but
>>> not too quckly, unless of course a large manual step has occurred) as
>>> the clocks potentially drift apart from each other.  This is basically
>>> what ntpd does, except that it can be virtually guaranteed that our
>>> implementation will suck by comparison.  Time synchronization is
>>> neither easy nor our core competency, and I think trying to include it
>>> in this feature is going to result in a net loss of reliability.
>>
>>
>> This begs the question of why we need this feature at all, in the way proposed.
>>
>> Streaming replication is designed for immediate transfer of WAL. File
>> based is more about storing them for some later use.
>>
>> It seems strange to pollute the *immediate* transfer route with a
>> delay, when that is easily possible with a small patch to pg_standby
>> that can wait until the filetime delay is > X before returning.
>>
>> The main practical problem with this is that most people's WAL
>> partitions aren't big enough to store the delayed WAL files, which is
>> why we provide the file archiving route anyway. So in practical terms
>> this will be unusable, or at least dangerous to use.
>>
>> +1 for the feature concept, but -1 for adding this to streaming replication.
>
> As implemented, the feature will work with either streaming
> replication or with file-based replication.

That sounds like the exact opposite of yours and Fujii's comments
above. Please explain.

> I don't see any value in
> restricting to work ONLY with file-based replication.

As explained above, it won't work in practice because of the amount of
file space required.

Or, an alternative question: what will you do when it waits so long
that the standby runs out of disk space?

If you hard-enforce the time delay specified then you just make
replication fail under during heavy loads.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-29 18:11:34
Message-ID: BANLkTi=33oiYV36cM_nOW7_Pi3t4E=P32g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 1:50 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> As implemented, the feature will work with either streaming
>> replication or with file-based replication.
>
> That sounds like the exact opposite of yours and Fujii's comments
> above. Please explain.

I think our comments above were addressing the issue of whether it's
feasible to correct for time skew between the master and the slave.
Tom was arguing that we should try, but I was arguing that any system
we put together is likely to be pretty unreliable (since good time
synchronization algorithms are quite complex, and to my knowledge no
one here is an expert on implementing them, nor do I think we want
that much complexity in the backend) and Fujii was pointing out that
it won't work at all if the WAL files are going through the archive
rather than through streaming replication, which (if I understand you
correctly) will be a more common case than I had assumed.

>> I don't see any value in
>> restricting to work ONLY with file-based replication.
>
> As explained above, it won't work in practice because of the amount of
> file space required.

I guess it depends on how busy your system is and how much disk space
you have. If using streaming replication causes pg_xlog to fill up on
your standby, then you can either (1) put pg_xlog on a larger file
system or (2) configure only restore_command and not primary_conninfo,
so that only the archive is used.

> Or, an alternative question: what will you do when it waits so long
> that the standby runs out of disk space?

I don't really see how that's any different from what happens now. If
(for whatever reason) the master is generating WAL faster than a
streaming standby can replay it, then the excess WAL is going to pile
up someplace, and you might run out of disk space. Time-delaying the
standby creates an additional way for that to happen, but I don't
think it's an entirely new problem.

I am not sure exactly how walreceiver handles it if the disk is full.
I assume it craps out and eventually retries, so probably what will
happen is that, after the standby's pg_xlog directory fills up,
walreceiver will sit there and error out until replay advances enough
to remove a WAL file and thus permit some more data to be streamed.
If the standby gets far enough behind the master that the required
files are no longer there, then it will switch to the archive, if
available. It might be nice to have a mode that only allows streaming
replication when the amount of disk space on the standby is greater
than or equal to some threshold, but that seems like a topic for
another patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 01:54:19
Message-ID: 4E0BD74B.7020105@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert,

> I don't really see how that's any different from what happens now. If
> (for whatever reason) the master is generating WAL faster than a
> streaming standby can replay it, then the excess WAL is going to pile
> up someplace, and you might run out of disk space. Time-delaying the
> standby creates an additional way for that to happen, but I don't
> think it's an entirely new problem.

Not remotely new. xlog partition full is currently 75% of the emergency
support calls PGX gets from clients on 9.0 (if only they'd pay attention
to their nagios alerts!)

> I am not sure exactly how walreceiver handles it if the disk is full.
> I assume it craps out and eventually retries, so probably what will
> happen is that, after the standby's pg_xlog directory fills up,
> walreceiver will sit there and error out until replay advances enough
> to remove a WAL file and thus permit some more data to be streamed.

Nope, it gets stuck and stops there. Replay doesn't advance unless you
can somehow clear out some space manually; if the disk is full, the disk
is full, and PostgreSQL doesn't remove WAL files without being able to
write files first.

Manual (or scripted) intervention is always necessary if you reach disk
100% full.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 01:55:53
Message-ID: 4E0BD7A9.2050307@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/29/11 11:11 AM, Robert Haas wrote:
> If the standby gets far enough behind the master that the required
> files are no longer there, then it will switch to the archive, if
> available.

One more thing:

As I understand it (and my testing shows this), the standby *prefers*
the archive logs, and won't switch to streaming until it reaches the end
of the archive logs. This is desirable behavior, as it minimizes the
load on the master.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 01:56:41
Message-ID: BANLkTimntX0B4ao_YKZemo7VtF8C3vy3-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 9:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> I am not sure exactly how walreceiver handles it if the disk is full.
>> I assume it craps out and eventually retries, so probably what will
>> happen is that, after the standby's pg_xlog directory fills up,
>> walreceiver will sit there and error out until replay advances enough
>> to remove a WAL file and thus permit some more data to be streamed.
>
> Nope, it gets stuck and stops there.  Replay doesn't advance unless you
> can somehow clear out some space manually; if the disk is full, the disk
> is full, and PostgreSQL doesn't remove WAL files without being able to
> write files first.
>
> Manual (or scripted) intervention is always necessary if you reach disk
> 100% full.

Wow, that's a pretty crappy failure mode... but I don't think we need
to fix it just on account of this patch. It would be nice to fix, of
course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 03:14:17
Message-ID: BANLkTinK04O8MZ7K8U8tguT1Oc6UjGYEQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 11:14 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 15, 2011 at 1:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> After we run "pg_ctl promote", time-delayed replication should be disabled?
>> Otherwise, failover might take very long time when we set recovery_time_delay
>> to high value.
>
> PFA a patch that I believe will disable recovery_time_delay after
> promotion.  The only change from the previous version is:
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog
> index 1dbf792..41b3ae9 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -5869,7 +5869,7 @@ pg_is_xlog_replay_paused(PG_FUNCTION_ARGS)
>  static void
>  recoveryDelay(void)
>  {
> -       while (1)
> +       while (!CheckForStandbyTrigger())
>        {
>                long    secs;
>                int             microsecs;

Thanks for updating patch! I have a few comments;

ISTM recoveryDelayUntilTime needs to be calculated also when replaying
the commit
*compact* WAL record (i.e., record_info == XLOG_XACT_COMMIT_COMPACT).

When the user uses only two-phase commit on the master, ISTM he or she cannot
use this feature. Because recoveryDelayUntilTime is never set in that
case. Is this
intentional?

We should disable this feature also after recovery reaches the stop
point (specified
in recovery_target_xxx)?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 03:23:58
Message-ID: BANLkTimrDU3zuRU5KEHbYWf7we3qVttEoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 10:56 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Nope, it gets stuck and stops there.  Replay doesn't advance unless you
>> can somehow clear out some space manually; if the disk is full, the disk
>> is full, and PostgreSQL doesn't remove WAL files without being able to
>> write files first.
>>
>> Manual (or scripted) intervention is always necessary if you reach disk
>> 100% full.
>
> Wow, that's a pretty crappy failure mode... but I don't think we need
> to fix it just on account of this patch.  It would be nice to fix, of
> course.

Yeah, we need to fix that as a separate patch. The difficult point is that
we cannot delete WAL files until we replay the checkpoint record and
restartpoint occurs. But, if the disk is full, there would be no space to
receive the checkpoint record, so we cannot WAL files infinitely.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 03:27:59
Message-ID: BANLkTi=8jm4hniKMauE4z9VYDrtYGd5sAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 12:14 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> We should disable this feature also after recovery reaches the stop
> point (specified in recovery_target_xxx)?

Another comment; it's very helpful to document the behavior of delayed standby
when promoting or after reaching the stop point.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-30 09:00:21
Message-ID: BANLkTi=nvWefCymUz8_mT_UK16trw-si6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 2:56 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 29, 2011 at 9:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> I am not sure exactly how walreceiver handles it if the disk is full.
>>> I assume it craps out and eventually retries, so probably what will
>>> happen is that, after the standby's pg_xlog directory fills up,
>>> walreceiver will sit there and error out until replay advances enough
>>> to remove a WAL file and thus permit some more data to be streamed.
>>
>> Nope, it gets stuck and stops there.  Replay doesn't advance unless you
>> can somehow clear out some space manually; if the disk is full, the disk
>> is full, and PostgreSQL doesn't remove WAL files without being able to
>> write files first.
>>
>> Manual (or scripted) intervention is always necessary if you reach disk
>> 100% full.
>
> Wow, that's a pretty crappy failure mode... but I don't think we need
> to fix it just on account of this patch.  It would be nice to fix, of
> course.

How is that different to running out of space in the main database?

If I try to pour a pint of milk into a small cup, I don't blame the cup.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-30 10:45:56
Message-ID: BANLkTinSWseoHAYp7u+-3RHjQekd=5M2oA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 7:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I don't really see how that's any different from what happens now.  If
> (for whatever reason) the master is generating WAL faster than a
> streaming standby can replay it, then the excess WAL is going to pile
> up someplace, and you might run out of disk space.   Time-delaying the
> standby creates an additional way for that to happen, but I don't
> think it's an entirely new problem.

The only way to control this is with a time delay that can be changed
while the server is running. A recovery.conf parameter doesn't allow
that, so another way is preferable.

I think the time problems are more complex than said. The patch relies
upon transaction completion times, but not all WAL records have a time
attached to them. Plus you only used commits anyway, not sure why.
Some actions aren't even transactional, such as DROP DATABASE, amongst
others. Consecutive records can be hours apart, so it would be
possible to delay on some WAL records but then replay records that
happened minutes ago, then wait hours for the next apply. So this
patch doesn't do what it claims in all cases.

Similar discussion on max_standby_delay covered exactly that ground
and went on for weeks in 9.0. IIRC I presented the same case you just
did and we agreed in the end that was not acceptable. I'm not going to
repeat it. Please check the archives.

So, again +1 for the feature, but -1 for the currently proposed
implementation, based upon review.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 17:00:51
Message-ID: 4E0CABC3.7000101@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/30/11 2:00 AM, Simon Riggs wrote:
>>> Manual (or scripted) intervention is always necessary if you reach disk
>>> >> 100% full.
>> >
>> > Wow, that's a pretty crappy failure mode... but I don't think we need
>> > to fix it just on account of this patch. It would be nice to fix, of
>> > course.
> How is that different to running out of space in the main database?
>
> If I try to pour a pint of milk into a small cup, I don't blame the cup.

I have to agree with Simon here. ;-)

We can do some things to make this easier for administrators, but
there's no way to "solve" the problem. And the things we could do would
have to be advanced optional modes which aren't on by default, so they
wouldn't really help the DBA with poor planning skills. Here's my
suggestions:

1) Have a utility (pg_archivecleanup?) which checks if we have more than
a specific settings's worth of archive_logs, and breaks replication and
deletes the archive logs if we hit that number. This would also require
some way for the standby to stop replicating *without* becoming a
standalone server, which I don't think we currently have.

2) Have a setting where, regardless of standby_delay settings, the
standby will interrupt any running queries and start applying logs as
fast as possible if it hits a certain number of unapplied archive logs.
Of course, given the issues we had with standby_delay, I'm not sure I
want to complicate it further.

I think we've already fixed the biggest issue in 9.1, since we now have
a limit on the number of WALs the master will keep if archiving is
failing ... yes? That's the only big *avoidable* failure mode we have,
where a failing standby effectively shuts down the master.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 17:05:48
Message-ID: BANLkTin4OhdSywpOP2n2gt+vt+fTBYBKHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 1:00 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 6/30/11 2:00 AM, Simon Riggs wrote:
>>>> Manual (or scripted) intervention is always necessary if you reach disk
>>>> >> 100% full.
>>> >
>>> > Wow, that's a pretty crappy failure mode... but I don't think we need
>>> > to fix it just on account of this patch.  It would be nice to fix, of
>>> > course.
>> How is that different to running out of space in the main database?
>>
>> If I try to pour a pint of milk into a small cup, I don't blame the cup.
>
> I have to agree with Simon here.  ;-)
>
> We can do some things to make this easier for administrators, but
> there's no way to "solve" the problem.  And the things we could do would
> have to be advanced optional modes which aren't on by default, so they
> wouldn't really help the DBA with poor planning skills.  Here's my
> suggestions:
>
> 1) Have a utility (pg_archivecleanup?) which checks if we have more than
> a specific settings's worth of archive_logs, and breaks replication and
> deletes the archive logs if we hit that number.  This would also require
> some way for the standby to stop replicating *without* becoming a
> standalone server, which I don't think we currently have.
>
> 2) Have a setting where, regardless of standby_delay settings, the
> standby will interrupt any running queries and start applying logs as
> fast as possible if it hits a certain number of unapplied archive logs.
>  Of course, given the issues we had with standby_delay, I'm not sure I
> want to complicate it further.
>
> I think we've already fixed the biggest issue in 9.1, since we now have
> a limit on the number of WALs the master will keep if archiving is
> failing ... yes?  That's the only big *avoidable* failure mode we have,
> where a failing standby effectively shuts down the master.

I'm not sure we changed anything in this area for 9.1. Am I wrong?
wal_keep_segments was present in 9.0. Using that instead of archiving
is a reasonable way to bound the amount of disk space that can get
used, at the cost of possibly needing to rebuild the standby if things
get too far behind. Of course, in any version, you could also use an
archive_command that will remove old files to make space if the disk
is full, with the same downside: if the standby isn't done with those
files, you're now in for a rebuild.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-30 17:25:13
Message-ID: BANLkTimGRb-cytLnibp1oxiUTkW=otrckg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 6:45 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> The only way to control this is with a time delay that can be changed
> while the server is running. A recovery.conf parameter doesn't allow
> that, so another way is preferable.

True. We've talked about making the recovery.conf parameters into
GUCs, which would address that concern (and some others).

> I think the time problems are more complex than said. The patch relies
> upon transaction completion times, but not all WAL records have a time
> attached to them. Plus you only used commits anyway, not sure why.

For the same reason we do that with the recovery_target_* code -
replaying something like a heap insert or heap update doesn't change
the user-visible state of the database, because the records aren't
visible anyway until the commit record is replayed.

> Some actions aren't even transactional, such as DROP DATABASE, amongst

Good point. We'd probably need to add a timestamp to the drop
database record, as that's a case that people would likely want to
defend against with this feature.

> others. Consecutive records can be hours apart, so it would be
> possible to delay on some WAL records but then replay records that
> happened minutes ago, then wait hours for the next apply. So this
> patch doesn't do what it claims in all cases.
>
> Similar discussion on max_standby_delay covered exactly that ground
> and went on for weeks in 9.0. IIRC I presented the same case you just
> did and we agreed in the end that was not acceptable. I'm not going to
> repeat it. Please check the archives.

I think this case is a bit different. First, max_standby_delay is
relevant for any installation using Hot Standby, whereas this is a
feature that specifically involves time. Saying that you have to have
time synchronization for Hot Standby to work as designed is more of a
burden than saying you need time synchronization *if you want to use
the time-delayed recovery feature*. Second, and maybe more
importantly, no one has come up with an idea for how to make this work
reliably in the presence of time skew. Perhaps we could provide a
simple time-skew correction feature that would work in the streaming
case (though probably not nearly as well as running ntpd), but as I
understand your argument, you're saying that most people will want to
use this with archiving. I don't see how to make that work without
time synchronization. In the max_standby_delay case, the requirement
is that queries not get cancelled too aggressively while at the same
time letting the standby get too far behind the master, which leaves
some flexibility in terms of how we actually make that trade-off, and
we eventually found a way that didn't require time synchronization,
which was an improvement. But for a time-delayed standby, the
requirement at least AIUI is that the state of the standby lag the
master by a certain time interval, and I don't see any way to do that
without comparing slave timestamps with master timestamps. If we can
find a similar clever trick here, great! But I'm not seeing how to do
it.

Now, another option here is to give up on the idea of a time-delayed
standby altogether and instead allow the standby to lag the master by
a certain number of WAL segments or XIDs. Of course, if we do that,
then we will not have a feature called "time-delayed standbys".
Instead, we will have a feature called "standbys delayed by a certain
number of WAL segments (or XIDs)". That certainly caters to some of
the same use cases, but I think it severely lacking in the usability
department. I bet the first thing most people will do is to try to
figure out how to translate between those metrics and time, and I bet
we'll get complaints on systems where the activity load is variable
and therefore the time lag for a fixed WAL-segment lag or XID-lag is
unpredictable. So I think keeping it defined it terms of time is the
right way forward, even though the need for external time
synchronization is, certainly, not ideal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 17:38:06
Message-ID: 4E0CB47E.7070909@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/30/11 10:25 AM, Robert Haas wrote:
> So I think keeping it defined it terms of time is the
> right way forward, even though the need for external time
> synchronization is, certainly, not ideal.

Actually, when we last had the argument about time synchronization,
Kevin Grittner (I believe) pointed out that unsynchronized replication
servers have an assortment of other issues ... like any read query
involving now(). As the person who originally brought up this hurdle, I
felt that his argument defeated mine.

Certainly I can't see any logical way to have time delay in the absence
of clock synchronization of some kind. Also, I kinda feel like this
discussion seems aimed at overcomplicating a feature which only a small
fraction of our users will ever use. Let's keep it as simple as possible.

As for delay on streaming replication, I'm for it. I think that
post-9.1, thanks to pgbasebackup, the number of our users who are doing
archive log shipping is going to drop tremendously.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-06-30 17:51:25
Message-ID: 4E0C714D020000250003ED89@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> when we last had the argument about time synchronization,
> Kevin Grittner (I believe) pointed out that unsynchronized
> replication servers have an assortment of other issues ... like
> any read query involving now().

I don't remember making that point, although I think it's a valid
one.

What I'm sure I pointed out is that we have one central router which
synchronizes to a whole bunch of atomic clocks around the world
using the normal "discard the outliers and average the rest"
algorithm, and then *every singe server and workstation on our
network synchronizes to that router*. Our database servers are all
running on Linux using ntpd. Our monitoring spams us with email if
any of the clocks falls outside nominal bounds. (It's been many
years since we had a misconfigured server which triggered that.)

I think doing anything in PostgreSQL around this beyond allowing
DBAs to trust their server clocks is insane. The arguments for
using and trusting ntpd is pretty much identical to the arguments
for using and trusting the OS file systems.

-Kevin


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 18:01:53
Message-ID: 4E0CBA11.2080402@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin,

> I think doing anything in PostgreSQL around this beyond allowing
> DBAs to trust their server clocks is insane. The arguments for
> using and trusting ntpd is pretty much identical to the arguments
> for using and trusting the OS file systems.

Oh, you don't want to implement our own NTP? Coward!

;-)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-06-30 18:25:54
Message-ID: BANLkTim=jJbEu--A6SEEkAzweH6X7x6kdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> I think doing anything in PostgreSQL around this beyond allowing
> DBAs to trust their server clocks is insane.  The arguments for
> using and trusting ntpd is pretty much identical to the arguments
> for using and trusting the OS file systems.

Except that implementing our own file system would likely have more
benefit and be less work than implementing our own time
synchronization, at least if we want it to be reliable.

Again, I am not trying to pretend that this is any great shakes.
MySQL's version of this feature apparently does somehow compensate for
time skew, which I assume must mean that their replication works
differently than ours - inter alia, it probably requires a TCP socket
connection between the servers. Since we don't require that, it
limits our options in this area, but also gives us more options in
other areas. Still, if I could think of a way to do this that didn't
depend on time synchronization, then I'd be in favor of eliminating
that requirement. I just can't; and I'm inclined to think it isn't
possible.

I wouldn't be opposed to having an option to try to detect time skew
between the master and the slave and, say, display that information in
pg_stat_replication. It might be useful to have that data for
monitoring purposes, and it probably wouldn't even be that much code.
However, I'd be a bit hesitant to use that data to "correct" the
amount of time we spend waiting for time-delayed replication, because
it would doubtless be extremely imprecise compared to real time
synchronization, and considerably more error-prone. IOW, what you
said.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-07-01 00:15:11
Message-ID: BANLkTi=+C3xP6M9TJZ6gBmQV8=97ZEQ_hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Some actions aren't even transactional, such as DROP DATABASE, amongst
>
> Good point.  We'd probably need to add a timestamp to the drop
> database record, as that's a case that people would likely want to
> defend against with this feature.

This means that recovery_target_* code would also need to deal with
DROP DATABASE case.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-07-01 00:24:30
Message-ID: BANLkTi=p7fMm7eOQmxgzFZFJTuV5ax8i=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 1, 2011 at 3:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> I think doing anything in PostgreSQL around this beyond allowing
>> DBAs to trust their server clocks is insane.  The arguments for
>> using and trusting ntpd is pretty much identical to the arguments
>> for using and trusting the OS file systems.
>
> Except that implementing our own file system would likely have more
> benefit and be less work than implementing our own time
> synchronization, at least if we want it to be reliable.
>
> Again, I am not trying to pretend that this is any great shakes.
> MySQL's version of this feature apparently does somehow compensate for
> time skew, which I assume must mean that their replication works
> differently than ours - inter alia, it probably requires a TCP socket
> connection between the servers.  Since we don't require that, it
> limits our options in this area, but also gives us more options in
> other areas.  Still, if I could think of a way to do this that didn't
> depend on time synchronization, then I'd be in favor of eliminating
> that requirement.  I just can't; and I'm inclined to think it isn't
> possible.
>
> I wouldn't be opposed to having an option to try to detect time skew
> between the master and the slave and, say, display that information in
> pg_stat_replication.  It might be useful to have that data for
> monitoring purposes, and it probably wouldn't even be that much code.
> However, I'd be a bit hesitant to use that data to "correct" the
> amount of time we spend waiting for time-delayed replication, because
> it would doubtless be extremely imprecise compared to real time
> synchronization, and considerably more error-prone.  IOW, what you
> said.

I agree with Robert. It's difficult to implement time-synchronization feature
which can deal with all the cases, and I'm not sure if that's really
worth taking
our time.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: "Jaime Casanova" <jaime(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-07-01 03:32:48
Message-ID: 87fwmqy8tb.fsf@casanova1.SEINGALT
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:

> On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Some actions aren't even transactional, such as DROP DATABASE, amongst
>>
>> Good point.  We'd probably need to add a timestamp to the drop
>> database record, as that's a case that people would likely want to
>> defend against with this feature.
>
> This means that recovery_target_* code would also need to deal with
> DROP DATABASE case.
>

there is no problem if you use "restore point" names... but of course
you lose flexibility (ie: you can't restore to 5 minutes before now)

mmm... a lazy idea: can't we just create a restore point wal record
*before* we actually drop the database? then we won't need to modify
logic about recovery_target_* (if it is only DROP DATABASE maybe that's
enough about complicating code) and we can provide that protection since
9.1

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL
Soporte 24x7, desarrollo, capacitación y servicios


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: time-delayed standbys
Date: 2011-07-02 12:32:04
Message-ID: CA+U5nMLeJMHnyfnSY-ffG++UCD0Z37GcmGAQZ3ZLkn9s-i8pWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 30, 2011 at 6:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> I think the time problems are more complex than said. The patch relies
>> upon transaction completion times, but not all WAL records have a time
>> attached to them. Plus you only used commits anyway, not sure why.
>
> For the same reason we do that with the recovery_target_* code -
> replaying something like a heap insert or heap update doesn't change
> the user-visible state of the database, because the records aren't
> visible anyway until the commit record is replayed.
>
>> Some actions aren't even transactional, such as DROP DATABASE, amongst
>
> Good point.  We'd probably need to add a timestamp to the drop
> database record, as that's a case that people would likely want to
> defend against with this feature.
>
>> others. Consecutive records can be hours apart, so it would be
>> possible to delay on some WAL records but then replay records that
>> happened minutes ago, then wait hours for the next apply. So this
>> patch doesn't do what it claims in all cases.

You misread my words above, neglecting the "amongst others" part.

I don't believe you'll be able to do this just by relying on
timestamps on WAL records because not all records carry timestamps and
we're not going to add them just for this.

It's easier to make this work usefully using pg_standby.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services