BUG #3266: SSL broken pipes kill the machine and fill the disk

Lists: pgsql-bugs
From: "Peter Koczan" <pjkoczan(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-10 20:14:39
Message-ID: 200705102014.l4AKEdbf033611@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 3266
Logged by: Peter Koczan
Email address: pjkoczan(at)gmail(dot)com
PostgreSQL version: 8.2.4
Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
Description: SSL broken pipes kill the machine and fill the disk
Details:

If a connection using SSL is terminated on the client side before a query
completes, postgres keeps trying to write to the broken connection, shooting
CPU and load very high and filling the postgres syslog (I have that pointed
to /var/log/pglog) with ~2000 of the following messages per second.

May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error:
Broken pipe

This quickly fills up the /var partition on the server.

To replicate the problem:
1. Connect to an running server using an SSL connection. Using psql is
fine.
2. Begin a query on any table. For full effect the query should be expensive
and large.
3. Kill psql *on the client side* BEFORE the query finishes (don't do
anything to the server side connection).
4. 'tail -f' wherever the postgres server output and error is going to.
5. Wait a few seconds while the server gets all of its data.
6. See thousands of error messages fill up your terminal on the server.

This has also happened when people stop web browsers in the middle of
serving up a postgresql-driven web page, but this is harder to replicate.

This usually terminates, but after 3 hours for a query that usually takes 20
seconds. During this time, the server is slow to the point of unusable.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Peter Koczan <pjkoczan(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-17 16:14:08
Message-ID: 200705171614.l4HGE8B17487@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


I didn't see any comment on this. Seems like a problem.

---------------------------------------------------------------------------

Peter Koczan wrote:
>
> The following bug has been logged online:
>
> Bug reference: 3266
> Logged by: Peter Koczan
> Email address: pjkoczan(at)gmail(dot)com
> PostgreSQL version: 8.2.4
> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
> Description: SSL broken pipes kill the machine and fill the disk
> Details:
>
> If a connection using SSL is terminated on the client side before a query
> completes, postgres keeps trying to write to the broken connection, shooting
> CPU and load very high and filling the postgres syslog (I have that pointed
> to /var/log/pglog) with ~2000 of the following messages per second.
>
> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error:
> Broken pipe
>
> This quickly fills up the /var partition on the server.
>
> To replicate the problem:
> 1. Connect to an running server using an SSL connection. Using psql is
> fine.
> 2. Begin a query on any table. For full effect the query should be expensive
> and large.
> 3. Kill psql *on the client side* BEFORE the query finishes (don't do
> anything to the server side connection).
> 4. 'tail -f' wherever the postgres server output and error is going to.
> 5. Wait a few seconds while the server gets all of its data.
> 6. See thousands of error messages fill up your terminal on the server.
>
> This has also happened when people stop web browsers in the middle of
> serving up a postgresql-driven web page, but this is harder to replicate.
>
> This usually terminates, but after 3 hours for a query that usually takes 20
> seconds. During this time, the server is slow to the point of unusable.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Koczan <pjkoczan(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-17 20:42:41
Message-ID: 464CBE41.9020208@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

This looks a lot like bug #2829 (excep that one is Windows), as I
mentioned here:
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php

Haven't looked into the actual code, though, but Tom had a suggestion in
the original bug, but AFAIK nobody has done that yet (at least not me.:)

//Magnus

Bruce Momjian wrote:
> I didn't see any comment on this. Seems like a problem.
>
> ---------------------------------------------------------------------------
>
> Peter Koczan wrote:
>> The following bug has been logged online:
>>
>> Bug reference: 3266
>> Logged by: Peter Koczan
>> Email address: pjkoczan(at)gmail(dot)com
>> PostgreSQL version: 8.2.4
>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
>> Description: SSL broken pipes kill the machine and fill the disk
>> Details:
>>
>> If a connection using SSL is terminated on the client side before a query
>> completes, postgres keeps trying to write to the broken connection, shooting
>> CPU and load very high and filling the postgres syslog (I have that pointed
>> to /var/log/pglog) with ~2000 of the following messages per second.
>>
>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error:
>> Broken pipe
>>
>> This quickly fills up the /var partition on the server.
>>
>> To replicate the problem:
>> 1. Connect to an running server using an SSL connection. Using psql is
>> fine.
>> 2. Begin a query on any table. For full effect the query should be expensive
>> and large.
>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do
>> anything to the server side connection).
>> 4. 'tail -f' wherever the postgres server output and error is going to.
>> 5. Wait a few seconds while the server gets all of its data.
>> 6. See thousands of error messages fill up your terminal on the server.
>>
>> This has also happened when people stop web browsers in the middle of
>> serving up a postgresql-driven web page, but this is harder to replicate.
>>
>> This usually terminates, but after 3 hours for a query that usually takes 20
>> seconds. During this time, the server is slow to the point of unusable.
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>


From: Peter Koczan <pjkoczan(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-17 22:22:44
Message-ID: 464CD5B4.7020500@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Yes, #2829 seems quite similar to my plight. I did take a look through
the code tree and there appear to be checks for an EINTR status within
loops in src/backend/libpq/pqcomm.c (line 725 in function pq_recvbuf and
line 1057 in function internal_flush), that could point to the problem.
I don't know enough about OpenSSL and it took me a long time to find out
as much as I did.

FYI, I compiled against OpenSSL 0.9.8d, if that makes any difference.

Peter

Magnus Hagander wrote:
> This looks a lot like bug #2829 (excep that one is Windows), as I
> mentioned here:
> http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php
>
> Haven't looked into the actual code, though, but Tom had a suggestion in
> the original bug, but AFAIK nobody has done that yet (at least not me.:)
>
> //Magnus
>
> Bruce Momjian wrote:
>
>> I didn't see any comment on this. Seems like a problem.
>>
>> ---------------------------------------------------------------------------
>>
>> Peter Koczan wrote:
>>
>>> The following bug has been logged online:
>>>
>>> Bug reference: 3266
>>> Logged by: Peter Koczan
>>> Email address: pjkoczan(at)gmail(dot)com
>>> PostgreSQL version: 8.2.4
>>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
>>> Description: SSL broken pipes kill the machine and fill the disk
>>> Details:
>>>
>>> If a connection using SSL is terminated on the client side before a query
>>> completes, postgres keeps trying to write to the broken connection, shooting
>>> CPU and load very high and filling the postgres syslog (I have that pointed
>>> to /var/log/pglog) with ~2000 of the following messages per second.
>>>
>>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error:
>>> Broken pipe
>>>
>>> This quickly fills up the /var partition on the server.
>>>
>>> To replicate the problem:
>>> 1. Connect to an running server using an SSL connection. Using psql is
>>> fine.
>>> 2. Begin a query on any table. For full effect the query should be expensive
>>> and large.
>>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do
>>> anything to the server side connection).
>>> 4. 'tail -f' wherever the postgres server output and error is going to.
>>> 5. Wait a few seconds while the server gets all of its data.
>>> 6. See thousands of error messages fill up your terminal on the server.
>>>
>>> This has also happened when people stop web browsers in the middle of
>>> serving up a postgresql-driven web page, but this is harder to replicate.
>>>
>>> This usually terminates, but after 3 hours for a query that usually takes 20
>>> seconds. During this time, the server is slow to the point of unusable.
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 6: explain analyze is your friend
>>>
>
>
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Peter Koczan <pjkoczan(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-17 22:43:03
Message-ID: 200705172243.l4HMh3o29850@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


Added to TODO:

* Fix problem with excessive logging during SSL disconnection

http://archives.postgresql.org/pgsql-bugs/2006-12/msg00122.php
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00065.php

---------------------------------------------------------------------------

Peter Koczan wrote:
> Yes, #2829 seems quite similar to my plight. I did take a look through
> the code tree and there appear to be checks for an EINTR status within
> loops in src/backend/libpq/pqcomm.c (line 725 in function pq_recvbuf and
> line 1057 in function internal_flush), that could point to the problem.
> I don't know enough about OpenSSL and it took me a long time to find out
> as much as I did.
>
> FYI, I compiled against OpenSSL 0.9.8d, if that makes any difference.
>
> Peter
>
> Magnus Hagander wrote:
> > This looks a lot like bug #2829 (excep that one is Windows), as I
> > mentioned here:
> > http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php
> >
> > Haven't looked into the actual code, though, but Tom had a suggestion in
> > the original bug, but AFAIK nobody has done that yet (at least not me.:)
> >
> > //Magnus
> >
> > Bruce Momjian wrote:
> >
> >> I didn't see any comment on this. Seems like a problem.
> >>
> >> ---------------------------------------------------------------------------
> >>
> >> Peter Koczan wrote:
> >>
> >>> The following bug has been logged online:
> >>>
> >>> Bug reference: 3266
> >>> Logged by: Peter Koczan
> >>> Email address: pjkoczan(at)gmail(dot)com
> >>> PostgreSQL version: 8.2.4
> >>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
> >>> Description: SSL broken pipes kill the machine and fill the disk
> >>> Details:
> >>>
> >>> If a connection using SSL is terminated on the client side before a query
> >>> completes, postgres keeps trying to write to the broken connection, shooting
> >>> CPU and load very high and filling the postgres syslog (I have that pointed
> >>> to /var/log/pglog) with ~2000 of the following messages per second.
> >>>
> >>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error:
> >>> Broken pipe
> >>>
> >>> This quickly fills up the /var partition on the server.
> >>>
> >>> To replicate the problem:
> >>> 1. Connect to an running server using an SSL connection. Using psql is
> >>> fine.
> >>> 2. Begin a query on any table. For full effect the query should be expensive
> >>> and large.
> >>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do
> >>> anything to the server side connection).
> >>> 4. 'tail -f' wherever the postgres server output and error is going to.
> >>> 5. Wait a few seconds while the server gets all of its data.
> >>> 6. See thousands of error messages fill up your terminal on the server.
> >>>
> >>> This has also happened when people stop web browsers in the middle of
> >>> serving up a postgresql-driven web page, but this is harder to replicate.
> >>>
> >>> This usually terminates, but after 3 hours for a query that usually takes 20
> >>> seconds. During this time, the server is slow to the point of unusable.
> >>>
> >>> ---------------------------(end of broadcast)---------------------------
> >>> TIP 6: explain analyze is your friend
> >>>
> >
> >
> >

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Peter Koczan <pjkoczan(at)gmail(dot)com>
To: Peter Koczan <pjkoczan(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-17 23:35:21
Message-ID: 464CE6B9.8020304@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

One more quick addendum...I tried this with non-SSL connections, and
this problem did *not* arise when using non-SSL connections.

Peter Koczan wrote:
> Yes, #2829 seems quite similar to my plight. I did take a look through
> the code tree and there appear to be checks for an EINTR status within
> loops in src/backend/libpq/pqcomm.c (line 725 in function pq_recvbuf
> and line 1057 in function internal_flush), that could point to the
> problem. I don't know enough about OpenSSL and it took me a long time
> to find out as much as I did.
>
> FYI, I compiled against OpenSSL 0.9.8d, if that makes any difference.
>
> Peter
>
> Magnus Hagander wrote:
>> This looks a lot like bug #2829 (excep that one is Windows), as I
>> mentioned here:
>> http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php
>>
>> Haven't looked into the actual code, though, but Tom had a suggestion in
>> the original bug, but AFAIK nobody has done that yet (at least not me.:)
>>
>> //Magnus
>>
>> Bruce Momjian wrote:
>>
>>> I didn't see any comment on this. Seems like a problem.
>>>
>>> ---------------------------------------------------------------------------
>>>
>>>
>>> Peter Koczan wrote:
>>>
>>>> The following bug has been logged online:
>>>>
>>>> Bug reference: 3266
>>>> Logged by: Peter Koczan
>>>> Email address: pjkoczan(at)gmail(dot)com
>>>> PostgreSQL version: 8.2.4
>>>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4
>>>> Description: SSL broken pipes kill the machine and fill the
>>>> disk
>>>> Details:
>>>> If a connection using SSL is terminated on the client side before a
>>>> query
>>>> completes, postgres keeps trying to write to the broken connection,
>>>> shooting
>>>> CPU and load very high and filling the postgres syslog (I have that
>>>> pointed
>>>> to /var/log/pglog) with ~2000 of the following messages per second.
>>>>
>>>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL
>>>> SYSCALL error:
>>>> Broken pipe
>>>>
>>>> This quickly fills up the /var partition on the server.
>>>>
>>>> To replicate the problem:
>>>> 1. Connect to an running server using an SSL connection. Using psql is
>>>> fine.
>>>> 2. Begin a query on any table. For full effect the query should be
>>>> expensive
>>>> and large.
>>>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do
>>>> anything to the server side connection).
>>>> 4. 'tail -f' wherever the postgres server output and error is going
>>>> to.
>>>> 5. Wait a few seconds while the server gets all of its data.
>>>> 6. See thousands of error messages fill up your terminal on the
>>>> server.
>>>>
>>>> This has also happened when people stop web browsers in the middle of
>>>> serving up a postgresql-driven web page, but this is harder to
>>>> replicate.
>>>>
>>>> This usually terminates, but after 3 hours for a query that usually
>>>> takes 20
>>>> seconds. During this time, the server is slow to the point of
>>>> unusable.
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 6: explain analyze is your friend
>>>>
>>
>>
>>
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Peter Koczan <pjkoczan(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3266: SSL broken pipes kill the machine and fill the disk
Date: 2007-05-18 00:47:13
Message-ID: 16441.1179449233@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> This looks a lot like bug #2829 (excep that one is Windows), as I
> mentioned here:
> http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php
> Haven't looked into the actual code, though, but Tom had a suggestion in
> the original bug, but AFAIK nobody has done that yet (at least not me.:)

I reproduced this on my own machine, and basically the problem seems to
be that secure_write() has been coded to bleat on every failure. This
behavior overrides the intelligence that was put into pqcomm.c's
internal_flush() a long time ago to not report consecutive write
failures ... which worked fine at the time it was written, because it
was just calling send() not secure_write(). secure_write is obviously
inconsistent anyway, since it doesn't elog anything in the non-SSL path.

Proposed fix:

1. For cases where errno conveys all the useful info (ie,
SSL_ERROR_SYSCALL), secure_write should elog nothing and just let
its caller do it, same as the plain send() path.

2. For SSL protocol errors (SSL_ERROR_SSL), we do want to print the
error at least once. It is not clear whether repeated calls would
be likely to produce the same failure, and we don't have any cheap
way to tell whether the messages are duplicate. I'm inclined to leave
that path alone until/unless we get reports of many duplicate messages
from it.

regards, tom lane