Streaming replication and unfit messages

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Streaming replication and unfit messages
Date: 2010-02-18 05:17:06
Message-ID: 3f0b79eb1002172117r261f5037t3224e24c24983ef9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

When the replication connection is closed unexpectedly, walsender might emit
the following unfit messages. IOW, the loss of the connection might be wrongly
regarded as an arrival of invalid message by the walsender. This looks messy.
We should get rid of that unfit FATAL message, emit a COMMERROR message and
just call proc_exit() when the loss of the connection is found?

> [2460]: LOG: could not receive data from client: No connection could be made because the target machine actively refused it.
> [2460]: FATAL: invalid standby closing message type 4
> [2460]: LOG: could not send data to client: No connection could be made because the target machine actively refused it.

Also the walsender wrongly tries to send the FATAL message to the standby even
though the connection has already been closed, and then gets the following LOG
message after the FATAL one. This FATAL message is suitable, but output of the
LOG message looks messy, too. We should use COMMERROR instead of FATAL and then
just call proc_exit() in order to prevent a message from being sent?

> [12586] FATAL: unexpected EOF on standby connection
> [12586] LOG: could not send data to client: Broken pipe

The attached patch fixes those unfit messages.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
fix_unfit_walsnd_msgs_0218.patch text/x-patch 2.3 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 08:23:17
Message-ID: 201002180923.20434.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday 18 February 2010 06:17:06 Fujii Masao wrote:
> > [2460]: LOG: could not receive data from client: No connection could be
> > made because the target machine actively refused it. [2460]: FATAL:
> > invalid standby closing message type 4
> > [2460]: LOG: could not send data to client: No connection could be made
> > because the target machine actively refused it.
>
> Also the walsender wrongly tries to send the FATAL message to the standby
> even though the connection has already been closed, and then gets the
> following LOG message after the FATAL one. This FATAL message is suitable,
> but output of the LOG message looks messy, too. We should use COMMERROR
> instead of FATAL and then just call proc_exit() in order to prevent a
> message from being sent?
Or hope for my idle query cancellation prelim. patch to get applied so you can
do ereport(FATAL | LOG_NO_CLIENT, ...) ;-)

Andres


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 09:43:41
Message-ID: 4B7D0BCD.3070101@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> Hi,
>
> When the replication connection is closed unexpectedly, walsender might emit
> the following unfit messages. IOW, the loss of the connection might be wrongly
> regarded as an arrival of invalid message by the walsender. This looks messy.
> We should get rid of that unfit FATAL message, emit a COMMERROR message and
> just call proc_exit() when the loss of the connection is found?
>
>> [2460]: LOG: could not receive data from client: No connection could be made because the target machine actively refused it.
>> [2460]: FATAL: invalid standby closing message type 4
>> [2460]: LOG: could not send data to client: No connection could be made because the target machine actively refused it.
>
> Also the walsender wrongly tries to send the FATAL message to the standby even
> though the connection has already been closed, and then gets the following LOG
> message after the FATAL one. This FATAL message is suitable, but output of the
> LOG message looks messy, too. We should use COMMERROR instead of FATAL and then
> just call proc_exit() in order to prevent a message from being sent?
>
>> [12586] FATAL: unexpected EOF on standby connection
>> [12586] LOG: could not send data to client: Broken pipe
>
> The attached patch fixes those unfit messages.

Actually the pg_getbyte_if_available() function is a bit confused. The
comment above it claims that it returns 0 if no data is available
without blocking, but it actually returns -1 with errno==EWOULDBLOCK.
That stems from confusion in secure_read(); with SSL it returns 0 if it
would block, but without SSL it returns -1 and EWOULDBLOCK. We should
fix that so that secure_read() behaves consistently and so that
pq_getbyte_if_available() behaves like e.g pq_getbytes().

Patch attached for that. pq_getbyte_if_available() now reports any
errors with COMMERROR level, and returns EOF if the connection is closed
cleanly. If no data is available without blocking it now really returns
0 as the comment said. Walsender reports any unexpected EOF to the log
at COMMERROR level, similar to what normal backends do.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
fix-pq_getbyte_if_available-1.patch text/x-diff 7.9 KB

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 09:58:31
Message-ID: 9837222c1002180158y47e5ed63wb2ad73f45357b8f4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/2/18 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
> Fujii Masao wrote:
>> Hi,
>>
>> When the replication connection is closed unexpectedly, walsender might emit
>> the following unfit messages. IOW, the loss of the connection might be wrongly
>> regarded as an arrival of invalid message by the walsender. This looks messy.
>> We should get rid of that unfit FATAL message, emit a COMMERROR message and
>> just call proc_exit() when the loss of the connection is found?
>>
>>> [2460]: LOG:  could not receive data from client: No connection could be made because the target machine actively refused it.
>>> [2460]: FATAL:  invalid standby closing message type 4
>>> [2460]: LOG:  could not send data to client: No connection could be made because the target machine actively refused it.
>>
>> Also the walsender wrongly tries to send the FATAL message to the standby even
>> though the connection has already been closed, and then gets the following LOG
>> message after the FATAL one. This FATAL message is suitable, but output of the
>> LOG message looks messy, too. We should use COMMERROR instead of FATAL and then
>> just call proc_exit() in order to prevent a message from being sent?
>>
>>> [12586] FATAL:  unexpected EOF on standby connection
>>> [12586] LOG:  could not send data to client: Broken pipe
>>
>> The attached patch fixes those unfit messages.
>
> Actually the pg_getbyte_if_available() function is a bit confused. The
> comment above it claims that it returns 0 if no data is available
> without blocking, but it actually returns -1 with errno==EWOULDBLOCK.
> That stems from confusion in secure_read(); with SSL it returns 0 if it
> would block, but without SSL it returns -1 and EWOULDBLOCK. We should
> fix that so that secure_read() behaves consistently and so that
> pq_getbyte_if_available() behaves like e.g pq_getbytes().
>
> Patch attached for that. pq_getbyte_if_available() now reports any
> errors with COMMERROR level, and returns EOF if the connection is closed
> cleanly. If no data is available without blocking it now really returns
> 0 as the comment said. Walsender reports any unexpected EOF to the log
> at COMMERROR level, similar to what normal backends do.

This cannot possibly be correct:
+ if (errno == EAGAIN || EWOULDBLOCK || errno == EINTR)

The middle argument is missing the errno== part.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 10:05:45
Message-ID: 4B7D10F9.70505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:
> This cannot possibly be correct:
> + if (errno == EAGAIN || EWOULDBLOCK || errno == EINTR)
>
>
> The middle argument is missing the errno== part.

Ahh, rats. Yeah it clearly is. Thanks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 10:31:01
Message-ID: 3f0b79eb1002180231xcf6ea2bta9b793bb0e1b6627@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 18, 2010 at 7:05 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Magnus Hagander wrote:
>> This cannot possibly be correct:
>> +                     if (errno == EAGAIN || EWOULDBLOCK || errno == EINTR)
>>
>>
>> The middle argument is missing the errno== part.
>
> Ahh, rats. Yeah it clearly is. Thanks.

Thanks for the patch! This seems nicer than mine.

* The received byte is stored in *c. Returns 1 if a byte was read, 0 if
! * if no data was available, or EOF if trouble.

Typo. 'if' is repeated.

+ ereport(COMMERROR,
+ (errcode_for_socket_access(),
+ errmsg("could not receive data from client: %m")));
+ return EOF;

We should use "r = EOF" instead of "return EOF" as well as other cases?

In WalSndHandshake(), when pq_getbyte() returns EOF, the COMMERROR message
"unexpected EOF on standby connection" is emitted doubly. How about removing
first COMMERROR message?

r = pq_getbyte_if_available(&firstchar);
if (r < 0)
{
! /* unexpected error */
ereport(COMMERROR,
! (errcode(ERRCODE_PROTOCOL_VIOLATION),
! errmsg("unexpected EOF on standby connection")));
! proc_exit(0);
}

Since pq_getbyte_if_available() returns EOF if trouble, "r == EOF" should
be used instead of "r < 0"?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Streaming replication and unfit messages
Date: 2010-02-18 11:14:15
Message-ID: 4B7D2107.5090500@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> * The received byte is stored in *c. Returns 1 if a byte was read, 0 if
> ! * if no data was available, or EOF if trouble.
>
> Typo. 'if' is repeated.
>
>
> + ereport(COMMERROR,
> + (errcode_for_socket_access(),
> + errmsg("could not receive data from client: %m")));
> + return EOF;
>
> We should use "r = EOF" instead of "return EOF" as well as other cases?

Yep.

> In WalSndHandshake(), when pq_getbyte() returns EOF, the COMMERROR message
> "unexpected EOF on standby connection" is emitted doubly. How about removing
> first COMMERROR message?

Yep.

> r = pq_getbyte_if_available(&firstchar);
> if (r < 0)
> {
> ! /* unexpected error */
> ereport(COMMERROR,
> ! (errcode(ERRCODE_PROTOCOL_VIOLATION),
> ! errmsg("unexpected EOF on standby connection")));
> ! proc_exit(0);
> }
>
> Since pq_getbyte_if_available() returns EOF if trouble, "r == EOF" should
> be used instead of "r < 0"?

Yep.

Committed with all those fixes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com