Re: Keepalive for max_standby_delay

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Keepalive for max_standby_delay
Date: 2010-06-02 22:59:23
Message-ID: AANLkTiljJpCHPLhJqcAJvYevskZ5RBeqt-JqBAchX2AL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Indeed, but nothing we do can prevent that, if the slave is just plain
> slower than the master.  You have to assume that the slave is capable of
> keeping up in the absence of query-caused delays, or you're hosed.

I was assuming the walreceiver only requests more wal in relatively
small chunks and only when replay has caught up and needs more data. I
haven't actually read this code so if that assumption is wrong then
I'm off-base. But if my assumption is right then it's not merely the
master running faster than the slave that can cause you to fall
arbitrarily far behind. The "receive time" is delayed by how long the
slave waited on a previous lock, so if we wait for a few minutes, then
proceed and find we need more wal data we'll read data from the master
that it could have generated those few minutes ago.

> The sticky point is that once in a blue moon you do have a conflicting
> query sitting on a buffer lock for a long time, or even more likely a
> series of queries keeping the WAL replay process from obtaining buffer
> cleanup lock.

So I think this isn't necessarily such a blue moon event. As I
understand it, all it would take is a single long-running report and a
vacuum or HOT cleanup occurring on the master. If I want to set
max_standby_delay to 60min to allow reports to run for up to an hour
then any random HOT cleanup on the master will propagate to the slave
and cause a WAL stall until the transactions which have that xid in
their snapshot end.

Even the buffer locks are could potentially be blocked for a long time
if you happen to run the right kind of query (say a nested loop with
the buffer in question on the outside and a large table on the
inside). That's a rarer event though; is that what you were thinking
of?

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-06-02 23:11:08 Re: Keepalive for max_standby_delay
Previous Message Robert Haas 2010-06-02 22:58:29 Re: recovery getting interrupted is not so unusual as it used to be