Quick Links

Re: libpq SSL with non-blocking sockets

Lists:	pgsql-hackers

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	libpq SSL with non-blocking sockets
Date:	2011-06-09 14:39:33
Message-ID:	4DF0DB25.9050105@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I believe I have found a bug in libpq COPY handling with non-blocking
SSL connections. The bug manifests itself by dropping the connection
in PGputCopyData() with "server closed the connection unexpectedly"
message. The connection drop only occurs with nonblocking connections -
blocking connections work as expected.

I'll skip a lot of debugging details, but what I found out was that
the connection drop was caused by a SSL "bad write retry" from
SSL_write(). Which in turn appears to be caused by bad data passed to
SSL_write() retry. So, what happened was that at some point SSL_write()
returned SSL_ERROR_WANT_WRITE meaning that the write needs to be
retried. Now, according to OpenSSL manual the retry SSL_write needs to
be passed the same arguments as the previous call:
http://www.openssl.org/docs/ssl/SSL_write.html#WARNING

However the next SSL_write was passed another buffer with a different
length and it failed with SSL_ERROR_SSL "bad write retry". Which in
turn caused the pqSendSome to drop the connection and failed the
PQputCopyData. Actually, now that I think of it, this is probably not
only related to COPY handling ...

The connection drop can be reproduced by introducing some "network
problems" to the connection, so that the SSL_write needs to be retried.
This can be most easily accomplished by passing a huge buffer to
PQputCopyData. For instance on my laptop I can do about 3 PQputCopyData
with a 100MB buffer before the connection is dropped. Others results may
vary. Also, it doesn't seem to matter what the libpq version is
used - I initially started on 8.3, but later confirmed that the same
problem exists on 8.4, 9.0 and 9.1. Tested with OpenSSL 0.9.8g and
0.9.8o, no idea about other versions.

The following is a log excerpt from my test program (attached) with some
extra logging added to libpq:

PQputCopyData #0 buf:100000000 total:0
pqFlush()
pqFlush() = 0
pqPutMsgEnd()
pqSendSome(99999744)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = 99999744:
SSL_error: 0, errno: 0
pqPutMsgEnd() = 0
PQputCopyData #1 buf:100000000 total:100000000
pqPutMsgEnd()
pqSendSome(99999744)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 99999744) = -1: SSL_error: 3,
errno: 11
pqPutMsgEnd() = 0
PQputCopyData #2 buf:100000000 total:200000000
pqFlush()
pqSendSome(100000266)
SSL_write(0x7e59d60, 0x7ff5a7eac010, 100000266) = -1: SSL_error:
3, errno: 11
pqFlush() = 1
pqPutMsgEnd()
pqSendSome(199999488)
SSL_write(0x7e59d60, 0x7ff597eab010, 199999488) = -1: SSL_error:
1, errno: 0
SSL error: bad write retry
pqPutMsgEnd() = -1
PQputCopyData() = -1: server closed the connection unexpectedly

In PQputCopyData #2 it is visible that the first SSL_write called from
pqFlush failed with SSL_ERROR_WANT_WRITE. The next SSL_write should
have been a retry with the same parameters, but instead was passed a
buffer with a different address and length. Hence the "bad write
retry". Some googling turned out similar issues for other projects
using SSL with non-blocking sockets.

The possible workarounds are to disable SSL or to disable non-blocking
libpq connections. Both are not always possible - security reasons, 3rd
party applications, drivers, etc. So I think this should be fixed in
libpq. Not sure exactly how though. It would seem that for the
PQputCopyData the best would be to return 0 to indicate that the
operation should be retried. No idea for the other possible cases of
SSL_write() retry though. What do you think?

regards,
Martin

Attachment	Content-Type	Size
copy_test.c	text/x-csrc	2.2 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-12 01:22:10
Message-ID:	BANLkTikUVp+DnKj=7hhi=FCWEwCDuRcohw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 9, 2011 at 10:39 AM, Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com> wrote:
> In PQputCopyData #2 it is visible that the first SSL_write called from
> pqFlush failed with SSL_ERROR_WANT_WRITE. The next SSL_write should
> have been a retry with the same parameters, but instead was passed a
> buffer with a different address and length. Hence the "bad write
> retry". Some googling turned out similar issues for other projects
> using SSL with non-blocking sockets.

I'm not surprised. Setting the API so that it requires the same
buffer (and not just a buffer pointing to the same data, or with the
same initial contents) was maybe not the greatest design decision.

> The possible workarounds are to disable SSL or to disable non-blocking
> libpq connections. Both are not always possible - security reasons, 3rd
> party applications, drivers, etc. So I think this should be fixed in
> libpq. Not sure exactly how though. It would seem that for the
> PQputCopyData the best would be to return 0 to indicate that the
> operation should be retried. No idea for the other possible cases of
> SSL_write() retry though. What do you think?

It seems to me we should try to paper over the problem within libpq
rather than allowing it to percolate any higher up the call stack.
One idea is that we could add outBuffer2/outBufSize2 to struct
pg_conn, or something along those lines with less obviously stupid
naming. Normally those would be unused, but in the special case where
SSL indicates that we must retry the call with the same arguments, we
set a flag that "freezes" the out buffer and forces any further data
to be stuffed into outBuffer2. If or when the operation finally
succeeds, we then move the data from outBuffer2 into outBuffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-15 19:20:34
Message-ID:	4DF90602.7060201@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/12/2011 04:22 AM, Robert Haas wrote:
> One idea is that we could add outBuffer2/outBufSize2 to struct
> pg_conn, or something along those lines with less obviously stupid
> naming. Normally those would be unused, but in the special case where
> SSL indicates that we must retry the call with the same arguments, we
> set a flag that "freezes" the out buffer and forces any further data
> to be stuffed into outBuffer2. If or when the operation finally
> succeeds, we then move the data from outBuffer2 into outBuffer.
>

Yes, that sounds like a good idea -- especially considering that COPY
is not the only operation that can cause SSL_write retries.

Attached is a first attempt at a patch to implement the described two
buffer approach. This modifies pqSendSome so that whenever a SSL
write retry is needed it saves the current outBuffer with its length
and attempted write size to connection's sslRetry* variables. A new
outBuffer is then allocated and used for any further data pushing.

After the SSL write retry buffer is set up, any further calls to
pqSendSome will first attempt to send the contents of the retry buffer,
returning 1 to indicate that not all of the data could be sent. If the
retry buffer is finally emptied it is freed and pqSendSome starts
sending from the regular outBuffer.

This is of course still "work in progress", needs cleaning up and
definitely more testing. But at this point before going any further,
I'd really appreciate a quick review from resident libpq gurus.

regards,
Martin

Attachment	Content-Type	Size
ssl-write-retry.patch	text/x-patch	5.3 KB

From:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-24 21:14:39
Message-ID:	BLU0-SMTP96A82532519FCF367B52CF8E520@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11-06-15 03:20 PM, Martin Pihlak wrote:
>
> Yes, that sounds like a good idea -- especially considering that COPY
> is not the only operation that can cause SSL_write retries.
>
>
> This is of course still "work in progress", needs cleaning up and
> definitely more testing. But at this point before going any further,
> I'd really appreciate a quick review from resident libpq gurus.
>

Martin,

I'm not a libpq guru but I've given your patch a look over.

The idea of storing the original buffer in new members of connection
structure for later re-use seems like a good way to approach this.

A few things I noticed (that you might be aware of since you mentioned
it needs cleanup)

-The patch doesn't compile with non-ssl builds, the debug at the bottom
of PQSendSome isn't in an #ifdef

-I don't think your handling the return code properly. Consider this case.

pqSendSome(some data)
sslRetryBuf = some Data
return 1
pqSendSome(more data)
it sends all of 'some data'
returns 0

I think 1 should be returned because all of 'more data' still needs to
be sent. I think returning a 0 will break PQsetnonblocking if you call
it when there is data in both sslRetryBuf and outputBuffer.
We might even want to try sending the data in outputBuffer after we've
sent all the data sitting in sslRetryBuf.

If you close the connection with an outstanding sslRetryBuf you need to
free it.

Other than running your test program I didn't do any testing with this
patch.

Steve

> regards,
> Martin
>
>
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-27 14:25:09
Message-ID:	BANLkTikbfkGCJuyiUdr=GOfjTc7=3q3x1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 24, 2011 at 5:14 PM, Steve Singer <ssinger_pg(at)sympatico(dot)ca> wrote:
> A few things I noticed (that you might be aware of since you mentioned it
> needs cleanup)
>
> -The patch doesn't compile with non-ssl builds, the debug at the bottom of
> PQSendSome isn't in an #ifdef
>
> -I don't think your handling the return code properly. Consider this case.
>
> pqSendSome(some data)
> sslRetryBuf = some Data
> return 1
> pqSendSome(more data)
> it sends all of 'some data'
> returns 0
>
> I think 1 should be returned because all of 'more data' still needs to be
> sent. I think returning a 0 will break PQsetnonblocking if you call it when
> there is data in both sslRetryBuf and outputBuffer.
> We might even want to try sending the data in outputBuffer after we've sent
> all the data sitting in sslRetryBuf.
>
>
> If you close the connection with an outstanding sslRetryBuf you need to free
> it.

Based on these comments, I have updated the status of the patch to
"Waiting on Author".

https://commitfest.postgresql.org/action/patch_view?id=594

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-28 18:14:44
Message-ID:	4E0A1A14.2070709@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06/25/2011 12:14 AM, Steve Singer wrote:
> I'm not a libpq guru but I've given your patch a look over.
>

Thanks for the review!

I have since simplified the patch to assume that partial SSL writes are
disabled -- according to SSL_write(3) this is the default behaviour.
Now the SSL retry buffer only holds the data to be retried, the
remainder is moved to the new outBuffer.

> -The patch doesn't compile with non-ssl builds, the debug at the bottom
> of PQSendSome isn't in an #ifdef
>

Ack, there was another one in pqFlush. Fixed that.

> -I don't think your handling the return code properly. Consider this case.
>
> pqSendSome(some data)
> sslRetryBuf = some Data
> return 1
> pqSendSome(more data)
> it sends all of 'some data'
> returns 0
>
> I think 1 should be returned because all of 'more data' still needs to
> be sent. I think returning a 0 will break PQsetnonblocking if you call
> it when there is data in both sslRetryBuf and outputBuffer.

Hmm, I thought I thought about that. There was a check in the original
patch: "if (conn->sslRetryBytes || (conn->outCount - remaining) > 0)"
So if the SSL retry buffer was emptied it would return 1 if there was
something left in the regular output buffer. Or did I miss something
there?

> We might even want to try sending the data in outputBuffer after we've
> sent all the data sitting in sslRetryBuf.
>

IMHO that'd add unnecessary complexity to the pqSendSome. As this only
happens in non-blocking mode the caller should be well prepared to
handle the retry.

> If you close the connection with an outstanding sslRetryBuf you need to
> free it.

Fixed.

New version of the patch attached.

regards,
Martin

Attachment	Content-Type	Size
ssl-write-retry-2.patch	text/x-patch	4.8 KB

From:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-06-30 13:25:14
Message-ID:	BLU0-SMTP93E0029461ABF3EFF6DDF98E580@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11-06-28 02:14 PM, Martin Pihlak wrote:
>
> Hmm, I thought I thought about that. There was a check in the original
> patch: "if (conn->sslRetryBytes || (conn->outCount - remaining)> 0)"
> So if the SSL retry buffer was emptied it would return 1 if there was
> something left in the regular output buffer. Or did I miss something
> there?
>

The issue I saw in the original patch was that at that point in the
function, sslRetryBytes could be zero (if the data was sent) but
conn->outCount - remaining would be an incorrect value since "remaining"
is the number of bytes left to send from sslRetryBuf NOT
conn->outBuffer. This is no longer an issue in the updated patch.
I will try to take a closer look at your updated patch in the next few days.

From:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-03 02:08:28
Message-ID:	BLU0-SMTP6343F49046496C3DB41A4B8E5D0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11-06-28 02:14 PM, Martin Pihlak wrote:
> Thanks for the review!
> I have since simplified the patch to assume that partial SSL writes are
> disabled -- according to SSL_write(3) this is the default behaviour.
> Now the SSL retry buffer only holds the data to be retried, the
> remainder is moved to the new outBuffer.
>

That sounds okay. Does it make sense to add in a check to verify that
SSL didn't send a partial write? I don't know how bad openssl is about
changing default behaviours or if we are concerned about protecting
against someone changing the SSL parameters. My inclination is that
this isn't needed but I'll raise the issue.
Fixed.
> New version of the patch attached.
>

Otherwise this version of the patch looks good to me.

The only testing I have done is running the test program you sent
earlier on in the thread and verified that the regression tests all
pass. Other than something like your test program I'm not sure how else
this bug can be induced.

Since the original patch was submitted as a WIP patch and this version
wasn't sent until well into the commit fest I am not sure if it
qualifies for a committer during this commitfest or if it needs to wait
until the next one.

> regards,
> Martin
>
>
>

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-08 13:36:00
Message-ID:	4E1707C0.9010601@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/03/2011 05:08 AM, Steve Singer wrote:
> Since the original patch was submitted as a WIP patch and this version
> wasn't sent until well into the commit fest I am not sure if it
> qualifies for a committer during this commitfest or if it needs to wait
> until the next one.
>

If possible I would like the fix to be backported as well. This is
quite a nasty bug and difficult to track down. Especially as the
actual SSL error messages are masked by the "server closed the
connection unexpectedly" message.

regards,
Martin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-09 03:33:54
Message-ID:	CA+TgmobVFzT2O9M9PgYRnQ0tz5SrLNg0Ths0b-tfz69LvWMxiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 8, 2011 at 9:36 AM, Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com> wrote:
> On 07/03/2011 05:08 AM, Steve Singer wrote:
>> Since the original patch was submitted as a WIP patch and this version
>> wasn't sent until well into the commit fest I am not sure if it
>> qualifies for a committer during this commitfest or if it needs to wait
>> until the next one.
>
> If possible I would like the fix to be backported as well. This is
> quite a nasty bug and difficult to track down. Especially as the
> actual SSL error messages are masked by the "server closed the
> connection unexpectedly" message.

I would not be inclined to back-patch this straight away. I agree
it's an important fix, but I am a little worried about the chances of
breaking something else... then again, I don't have the only vote
here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-15 21:46:49
Message-ID:	10265.1310766409@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Jul 8, 2011 at 9:36 AM, Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com> wrote:
>> On 07/03/2011 05:08 AM, Steve Singer wrote:
>>> Since the original patch was submitted as a WIP patch and this version
>>> wasn't sent until well into the commit fest I am not sure if it
>>> qualifies for a committer during this commitfest or if it needs to wait
>>> until the next one.

>> If possible I would like the fix to be backported as well. This is
>> quite a nasty bug and difficult to track down. Especially as the
>> actual SSL error messages are masked by the "server closed the
>> connection unexpectedly" message.

> I would not be inclined to back-patch this straight away. I agree
> it's an important fix, but I am a little worried about the chances of
> breaking something else... then again, I don't have the only vote
> here.

I reviewed this patch a bit. I agree that we have a bug here that
should be fixed and backported, but I don't think this patch is ready.
Some problems:

1. The behavior under low-memory conditions is pretty intolerable.
As coded, a malloc failure here:

+ conn->outBufSize = Max(16 * 1024, remaining);
+ conn->outBuffer = malloc(conn->outBufSize);
+ if (conn->outBuffer == NULL)
+ {
+ printfPQExpBuffer(&conn->errorMessage,
+ "cannot allocate memory for output buffer\n");
+ return -1;
+ }

leaves libpq's internal state completely corrupt --- outBuffer is null,
which will result in instant coredump on any future access, but that's
just the tip of the iceberg because we've also lost buffered data and
messed up the none-too-clearly-defined-anyway state of which pending
data is in which buffer.

In general, I don't think it's a smart idea to proceed by duplicating
the buffer contents in this situation, particularly not if the most
obvious way to cause the problem is to have a very large buffer :-(.
I think the direction to move in ought to be to use the existing buffer
as-is, and have pqCheckOutBufferSpace refuse to enlarge the buffer while
we are in "write frozen" state. It should be OK to append data to the
buffer, though, so long as we remember how much we're allowed to pass to
SSL_write when we next try to write.

2. According to the OpenSSL man pages, *both* SSL_read and SSL_write are
defined to need to be re-called with the identical arguments after a
WANT_READ or WANT_WRITE return. This means that pqReadData() is also at
risk for this type of bug. I believe it could only happen in code paths
where we call pqReadData when there is already some data in the buffer:
if the caller then consumes some of that data, the next call to
pqReadData would try to compact the buffer contents before making the
pqsecure_read call. I think we probably need a "read frozen" state in
which we won't enlarge or compact the inBuffer until SSL_read succeeds.

3. As of 9.0, somebody's decided that the backend needs nonblock read
semantics in some cases. I probably should have complained harder about
that before it went in, but since it's in, the backend is at risk for
this same type of issue in secure_read().

I will mark the patch Returned With Feedback. I think that we need to
address all these issues before we consider applying it. If we weren't
hard up against the end of the CommitFest I might have a go at it
myself, but I can't justify the time right now.

regards, tom lane

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 12:55:03
Message-ID:	4E2C1627.2030908@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/16/2011 12:46 AM, Tom Lane wrote:
> I think the direction to move in ought to be to use the existing buffer
> as-is, and have pqCheckOutBufferSpace refuse to enlarge the buffer while
> we are in "write frozen" state. It should be OK to append data to the
> buffer, though, so long as we remember how much we're allowed to pass to
> SSL_write when we next try to write.

Alternative to freezing the outBuffer would be to set
SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER mode during SSL initialisation.
That would enable the buffer address to be changed in-between the
SSL_write calls, so long as the content remains the same. Attached
is a patch that uses the single buffer approach described by Tom, but
with a moving SSL write buffer enabled.

Modifying pqCheckOutBufferSpace is also an option, but it'd break some
(arguably already broken) client applications that don't have proper
retry handling. Notably some versions of psycopg2 have problems with
handling zero return values from PQputCopyData. So ISTM that from
backporting perspective the moving write buffer is a bit safer.

I'll see if I can come up with a test case for the SSL_read retry before
attempting to fix that.

regards,
Martin

Attachment	Content-Type	Size
ssl-write-retry-3.patch	text/x-patch	2.7 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 16:22:25
Message-ID:	13077.1311524545@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com> writes:
> On 07/16/2011 12:46 AM, Tom Lane wrote:
>> I think the direction to move in ought to be to use the existing buffer
>> as-is, and have pqCheckOutBufferSpace refuse to enlarge the buffer while
>> we are in "write frozen" state. It should be OK to append data to the
>> buffer, though, so long as we remember how much we're allowed to pass to
>> SSL_write when we next try to write.

> Alternative to freezing the outBuffer would be to set
> SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER mode during SSL initialisation.

Hmmm ... that is a really interesting flag. The documentation is
inadequate, but I googled a bit and found this thread about it:
http://www.mail-archive.com/openssl-users(at)openssl(dot)org/msg45440.html
in which it's stated that (1) the prohibition on moving the buffer is
actually just a sort of sanity check, and can be turned off using this
flag; (2) it is okay to increase, but not reduce, the length parameter
to SSL_write in the next call after a WANT_READ/WANT_WRITE return.

So this does look like it'd fix the issue for SSL_write, without needing
to introduce a concept of a "write frozen" buffer state. I am wondering
though how far back support for this flag exists in OpenSSL, and whether
the situation is the same for SSL_read or not. I don't see any
SSL_MODE_ACCEPT_MOVING_READ_BUFFER macro ...

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 16:45:35
Message-ID:	13539.1311525935@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> So this does look like it'd fix the issue for SSL_write, without needing
> to introduce a concept of a "write frozen" buffer state. I am wondering
> though how far back support for this flag exists in OpenSSL,

A bit of archaeology reveals that the flag was introduced in OpenSSL
0.9.4, released in 1999. So it's probably Old Enough. (BTW, the 0.9.4
changelog credits this change to one Bodo Moeller ... so the comments
from him in the other thread I linked to can be considered authoritative
...)

Still wondering about the SSL_read end of it, though.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 17:10:36
Message-ID:	13973.1311527436@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Still wondering about the SSL_read end of it, though.

And on that front, some digging around in the OpenSSL source code
indicates that they do all their work in internal buffers, and transfer
data into SSL_read's result buffer only when ready to return it.
So the claim in the documentation that SSL_read has a restriction
comparable to SSL_write is a lie: there is no case where they'll copy
some data into the buffer and then return -1.

So the SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER solution looks like a good
fix. I'll see about applying it.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
Cc:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 20:33:40
Message-ID:	5219.1311539620@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com> writes:
> If possible I would like the fix to be backported as well. This is
> quite a nasty bug and difficult to track down. Especially as the
> actual SSL error messages are masked by the "server closed the
> connection unexpectedly" message.

I've applied the simplified fix (just set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER)
as well as a patch to improve the error reporting situation.

regards, tom lane

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-24 23:49:53
Message-ID:	CAEYLb_XP4tjTr8EvFwraAwbBVTjPWtLvkgoxMf1Dixktm38_8g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24 July 2011 21:33, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've applied the simplified fix (just set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER)
> as well as a patch to improve the error reporting situation.

I'm not exactly sure why, and don't have time to investigate right
now, but this commit (fee476da952a1f02f7ccf6e233fb4824c2bf6af4)
appears to have broken the build for me:

gcc -O0 -g -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wformat-security
-fno-strict-aliasing -fwrapv -g -pthread -D_REENTRANT -D_THREAD_SAFE
-D_POSIX_PTHREAD_SEMANTICS -fpic -DFRONTEND -DUNSAFE_STAT_OK -I.
-I../../../src/include -D_GNU_SOURCE -I../../../src/port
-I../../../src/port -DSO_MAJOR_VERSION=5 -c -o fe-misc.o fe-misc.c
-MMD -MP -MF .deps/fe-misc.Po
fe-misc.c: In function ‘pqReadData’:
fe-misc.c:651:11: error: ‘PGconn’ has no member named ‘ssl’
fe-misc.c:743:11: error: ‘PGconn’ has no member named ‘ssl’
fe-misc.c:761:10: error: ‘PGconn’ has no member named ‘ssl’
fe-misc.c: In function ‘pqSendSome’:
fe-misc.c:841:14: error: ‘PGconn’ has no member named ‘ssl’
fe-misc.c:861:14: error: ‘PGconn’ has no member named ‘ssl’

The problem goes away if I check out the commit made immediately prior
to this one - d0c23026b2499ba9d6797359241ade076a5a677d. I'm building
with my usual, rather generic settings for hacking.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-25 00:55:51
Message-ID:	4E2CBF17.4040105@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/24/2011 07:49 PM, Peter Geoghegan wrote:
> On 24 July 2011 21:33, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I've applied the simplified fix (just set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER)
>> as well as a patch to improve the error reporting situation.
> I'm not exactly sure why, and don't have time to investigate right
> now, but this commit (fee476da952a1f02f7ccf6e233fb4824c2bf6af4)
> appears to have broken the build for me:
>
>

Yeah, looks like it fails unless --with-openssl is set.

cheers

andrew

From:	Martin Pihlak <martin(dot)pihlak(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: libpq SSL with non-blocking sockets
Date:	2011-07-25 07:33:29
Message-ID:	4E2D1C49.2090302@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/24/2011 11:33 PM, Tom Lane wrote:
> I've applied the simplified fix (just set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER)
> as well as a patch to improve the error reporting situation.
>

Cool that this turned out to be a one-line fix. Thanks!

regards,
Martin