Re: walsender doesn't send keepalives when writes are pending

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: walsender doesn't send keepalives when writes are pending
Date: 2014-03-05 16:26:13
Message-ID: 53175025.9040805@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/25/2014 06:41 PM, Robert Haas wrote:
> On Tue, Feb 25, 2014 at 11:23 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Usually that state will be reached very quickly because before
>> that we're writing data to the network as fast as it can be read from
>> disk.
>
> I'm unimpressed. Even if that is in practice true, making the code
> self-consistent is a goal of non-trivial value. The timing of sending
> keep-alives has no business depending on the state of the write queue,
> and right now it doesn't. Your patch would make it depend on that,
> mostly by accident AFAICS.

Yeah, the timeout should should be checked regardless of the status of
the write queue. Per the attached patch.

While looking at this, I noticed that the time we sleep between each
iteration is calculated in a funny way. It's not this patch's fault, but
moving the code made it more glaring. After the patch, it looks like this:

> TimestampTz timeout;
> long sleeptime = 10000; /* 10 s */
> ...
> /*
> * If wal_sender_timeout is active, sleep in smaller increments
> * to not go over the timeout too much. XXX: Why not just sleep
> * until the timeout has elapsed?
> */
> if (wal_sender_timeout > 0)
> sleeptime = 1 + (wal_sender_timeout / 10);
>
> /* Sleep until something happens or we time out */
> ...
> WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
> MyProcPort->sock, sleeptime);
> ...
> /*
> * Check for replication timeout. Note we ignore the corner case
> * possibility that the client replied just as we reached the
> * timeout ... he's supposed to reply *before* that.
> */
> timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
> wal_sender_timeout);
> if (wal_sender_timeout > 0 && GetCurrentTimestamp() >= timeout)
> {
> ...
> }

The logic was the same before the patch, but I added the XXX comment
above. Why do we sleep in increments of 1/10 of wal_sender_timeout?
Originally, the code calculated when the next wakeup should happen, by
adding wal_sender_timeout (or replication_timeout, as it was called back
then) to the time of the last reply. Why don't we do that?

The code seems to have just slowly devolved into that. It was changed
from sleep-until-timeout to sleep 1/10th of wal_sender_timeout by a
patch that made walsender send a keep-alive message to the client every
time it wakes up, and I guess the idea was to send a message at an
interval that's 1/10th of the timeout. But that idea is long gone by
now; first, commit da4efa13d801ccc179f1d2c24d8a60c4a2f8ede9 turned off
sending the keep-alive messages by default, and then
6f60fdd7015b032bf49273c99f80913d57eac284 turned them back on, but so
that it's only sent once when 1/2 of wal_sender_timeout has elapsed.

I think that should be changed back to sleep until the next deadline of
when something needs to happen, instead of polling. But that can be a
separate patch.

- Heikki

Attachment Content-Type Size
walsender-always-send-keepalives-2.patch text/x-diff 2.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-03-05 16:28:00 Re: jsonb and nested hstore
Previous Message Merlin Moncure 2014-03-05 16:24:27 Re: jsonb and nested hstore