Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2012-10-19 11:42:16
Message-ID: 6C0B27F7206C9E4CA54AE035729E9C382853BBED@szxeml509-mbs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
On Wed, Oct 17, 2012 at 8:46 PM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>> On Monday, October 15, 2012 3:43 PM Heikki Linnakangas wrote:
>> On 13.10.2012 19:35, Fujii Masao wrote:
>> > On Thu, Oct 11, 2012 at 11:52 PM, Heikki Linnakangas
>> > <hlinnakangas(at)vmware(dot)com> wrote:
>> >> Ok, thanks. Committed.
>> >
>> > I found one typo. The attached patch fixes that typo.
>>
>> Thanks, fixed.
>>
>> > ISTM you need to update the protocol.sgml because you added
>> > the field 'replyRequested' to WalSndrMessage and StandbyReplyMessage.
>
>
>>
>> > Is it worth adding the same mechanism (send back the reply immediately
>> > if walsender request a reply) into pg_basebackup and pg_receivexlog?
>>
>> Good catch. Yes, they should be taught about this too. I'll look into
>> doing that too.
>
> If you have not started and you don't have objection, I can pickup this to
> complete it.
>
> For both (pg_basebackup and pg_receivexlog), we need to get a timeout
> parameter from user in command line, as
> there is no conf file here. New Option can be -t (parameter name can be
> recvtimeout).
>
> The main changes will be in function ReceiveXlogStream(), it is a common
> function for both
> Pg_basebackup and pg_receivexlog. Handling will be done in same way as we
> have done in walreceiver.
>
> Suggestions/Comments?

>Before implementing the timeout parameter, I think that it's better to change
>both pg_basebackup background process and pg_receivexlog so that they
>send back the reply message immediately when they receive the keepalive
>message requesting the reply. Currently, they always ignore such keepalive
>message, so status interval parameter (-s) in them always must be set to
>the value less than replication timeout. We can avoid this troublesome
>parameter setting by introducing the same logic of walreceiver into both
>pg_basebackup background process and pg_receivexlog.

Please find the patch attached to address the modification mentioned by you (send immediate reply for keepalive).
Both basebackup and pg_receivexlog uses the same function ReceiveXLogStream, so single change for both will address the issue.

Now further to this for introducing timeout in pg_basebackup and pg_receivexlog:
We can have mechanism similar to wal receiver timeout while streaming the data from server, but same logic can not be used incase network goes down during getting other database file from server.
The reason for the same is to receive the data files PQgetCopyData() is called in synchronous mode, so it keeps waiting for infinite time till it gets some data.
In order to solve this issue, I can think of following options:
1. Making this call also asynchronous (but now sure about impact of this).
2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite wait), we can send some finite time. This time can be received as command line argument
from respective utility and set the same in PGconn structure.
In order to have timeout value in PGconn, we can have:
a. Add new parameter in PGconn to indicate the receive timeout.
b. Use the existing parameter connect_timeout for receive timeout also but this may lead to confusion.
3. Any other better option?

Apart from above issue, there is possibility that if during connect time network goes down, then it might hang, because connect_timeout by default will be NULL and connectDBComplete will start waiting inifinitely for connection to become successful.
So shall we have command line argument separately for this also or any other way as you suugest.

Suggestions/Comments

With Regards,
Amit Kapila.

Attachment Content-Type Size
pg_basebackup_keepalive_reply.patch application/octet-stream 3.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message pgmail 2012-10-19 13:39:52 BUG #7615: CREATE RULE + DEFAULT VALUES + pg_dump trouble
Previous Message Kapil Kr. Khandelwal 2012-10-19 11:31:14 Re: Fwd: How to run query by command prompt in Postgres

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2012-10-19 11:53:16 Re: [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility
Previous Message Shigeru HANADA 2012-10-19 11:17:47 Re: Move postgresql_fdw_validator into dblink