Re: BUG #2576: tcp_keepalive doesn't work

Lists: pgsql-bugs
From: "Fujii Masao" <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-15 08:12:33
Message-ID: 200608150812.k7F8CXTq048890@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 2576
Logged by: Fujii Masao
Email address: fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp
PostgreSQL version: 8.1.4
Operating system: Fedora Core 5
Description: tcp_keepalive doesn't work
Details:

Hi.

I found an error that tcp_keepalive doesn't work.
The sequence of steps to reproduce the problem is as follow.

----------
[terminal 1]
$ postmaster -p xxxx

NOTE:
- postgresql.conf that I changed from the default is as follow.

listen_addresses = '*'

tcp_keepalives_idle = 10
tcp_keepalives_interval = 5
tcp_keepalives_count = 2

- I add the Client Authentication Record to pg_hba.conf.

host all all xx.xx.xx.xx/xx trust

[terminal 2]
$ pgbench testdb -c 50 -t 100 -h xx.xx.xx.xx -p xxxx

[terminal 3]
# /sbin/ifdown eth0 #ifdown while running pgbench

Then, 50 postgres processes keep alive though tcp_keepalive
should kill them within 30 seconds at latest.

[terminal 4]
$ sleep 30
$ ps -ef | grep postgres
...
postgres 16815 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38738) idle in transaction
postgres 16816 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38739) UPDATE waiting
postgres 16817 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38740) UPDATE waiting
postgres 16818 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38741) UPDATE waiting
postgres 16819 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38742) UPDATE waiting
postgres 16820 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38743) idle in transaction
postgres 16821 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38744) UPDATE waiting
postgres 16822 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38745) UPDATE waiting
postgres 16823 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
xx.xx.xx.xx(38746) idle
...

----------

I'll be pleased if you cope with the problem.

Thanks in advance.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Fujii Masao" <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-15 13:30:16
Message-ID: 2407.1155648616@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Fujii Masao" <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp> writes:
> I found an error that tcp_keepalive doesn't work.

You seem to have a misunderstanding of what tcp_keepalive is for. It
does not kill a backend that is in the midst of a query. A backend will
terminate when it is waiting for a client command and it sees that the
connection has been lost --- which is what a TCP timeout will cause to
happen. But your example backends are not waiting for client commands.

If you want backends to abort live queries before they finish, use
statement_timeout.

regards, tom lane


From: Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-16 00:52:25
Message-ID: 44E26C49.9060108@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi.

> You seem to have a misunderstanding of what tcp_keepalive is for. It
> does not kill a backend that is in the midst of a query. A backend will
> terminate when it is waiting for a client command and it sees that the
> connection has been lost --- which is what a TCP timeout will cause to
> happen. But your example backends are not waiting for client commands.

No. In my example, there are backends that are waiting for client commands.

Please pay attention to the result of 'ps -ef | grep postgres'.

>> [terminal 4]
>> $ sleep 30
>> $ ps -ef | grep postgres
>> ...
>> postgres 16815 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38738) idle in transaction
>> postgres 16816 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38739) UPDATE waiting
>> postgres 16817 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38740) UPDATE waiting
>> postgres 16818 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38741) UPDATE waiting
>> postgres 16819 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38742) UPDATE waiting
>> postgres 16820 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38743) idle in transaction
>> postgres 16821 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38744) UPDATE waiting
>> postgres 16822 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38745) UPDATE waiting
>> postgres 16823 16782 0 17:06 pts/1 00:00:00 postgres: postgres sampledb
>> xx.xx.xx.xx(38746) idle
>> ...

'idle in transaction' backends don't terminate though they are waiting for
a client command and connection has been lost.

By the way, I ran strace for 'idle in transaction' backend.
-----
$ strace -p xxx
Process xxx attached - interrupt to quit
recv(7,
-----

best regards;


From: Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-18 09:32:05
Message-ID: 44E58915.5090202@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi.

I found the cause of the error that tcp_keepalive doesn't work.
The cause is a specification of linux kernel.

In the specification of linux kernel, tcp_keepalive doesn't work
if the network outage occurs before receiving ACK for send() system-call.
This behavior of tcp_keepalive is reported even by LKML.

Linux-Kernel Archive: Re: 2.6.12.5 bug? per-socket TCP keepalive settings
http://www.ussg.iu.edu/hypermail/linux/kernel/0508.2/0757.html

I'm confused whether tcp_keepalive problem should be solved
at the DB level.

regards;


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-18 13:00:53
Message-ID: 7342.1155906053@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Linux-Kernel Archive: Re: 2.6.12.5 bug? per-socket TCP keepalive settings
> http://www.ussg.iu.edu/hypermail/linux/kernel/0508.2/0757.html

> I'm confused whether tcp_keepalive problem should be solved
> at the DB level.

According to that, Linux keepalive starts working once you have either
sent or received at least one byte over the connection. Therefore it's
not possible to get past the authentication stage without keepalive
being ready to go. And we do have a pretty short timeout on the auth
stage (1 minute if memory serves). So I'm not seeing what problem we
need to solve.

In any case, if you don't like that behavior methinks you need to be
lobbying some kernel hackers, not database weenies. Postgres is not
in the business of second-guessing the TCP stack.

regards, tom lane


From: Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-21 07:27:15
Message-ID: 44E96053.10200@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> According to that, Linux keepalive starts working once you have either
> sent or received at least one byte over the connection. Therefore it's
> not possible to get past the authentication stage without keepalive
> being ready to go. And we do have a pretty short timeout on the auth
> stage (1 minute if memory serves).

Do you mean authentication_timeout?
It's useless in my example because 'idle in transaction' backends
have already gotten past the auth stage.

> So I'm not seeing what problem we
> need to solve.

Please consider the system that puts the load from
two AP servers to one DB server.
At this time, the problem is that the processing of overall DB server might stop
when the LAN cable of only one AP server comes off.
Because 'idle in transaction' backends that don't terminate by tcp_keepalive
keep holding the lock.

>
> In any case, if you don't like that behavior methinks you need to be
> lobbying some kernel hackers, not database weenies. Postgres is not
> in the business of second-guessing the TCP stack.

ok.
I also think that solving at the Kernel level is the best in this problem.

best regards;


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2576: tcp_keepalive doesn't work
Date: 2006-08-21 13:41:21
Message-ID: 11047.1156167681@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Fujii Masao <fujii(dot)masao(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Because 'idle in transaction' backends that don't terminate by tcp_keepalive
> keep holding the lock.

If a backend has gotten to the point of being 'idle in transaction',
then it's certainly exchanged data with the client, so TCP keepalive
should work. I suggest you need to file a kernel bug report if it
doesn't.

regards, tom lane