Re: Replication to Postgres 10 on Windows is broken

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: "Augustine, Jobin" <jobin(dot)augustine(at)openscg(dot)com>, pgsql-bugs(at)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Replication to Postgres 10 on Windows is broken
Date: 2017-08-06 16:29:07
Message-ID: 6525.1502036947@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Sun, Aug 06, 2017 at 11:17:57AM -0400, Tom Lane wrote:
>> Gut instinct says that the reason this case fails when other tools
>> can connect successfully is that libpqwalreceiver is the only tool
>> that uses PQconnectStart/PQconnectPoll rather than a plain
>> PQconnectdb, and that there is some behavioral difference between
>> connectDBComplete's wait loop and libpqrcv_connect's wait loop that

> That would fit. Until v10 (commit 1e8a850), PQconnectStart() had no in-tree
> callers outside of libpq itself.

Yeah. After some digging around I think I see exactly what is happening.
The error message would be better read as "Socket is not connected *yet*",
that is, the problem is that we're trying to write data before the
nonblocking connection request has completed. (This fits with the OP's
observation that local loopback connections work fine --- they probably
complete immediately.) PQconnectPoll believes that it just has to wait
for write-ready when waiting for a connection to complete. When using
connectDBComplete's wait loop, that reduces to a call to Windows' version
of select(2), in pqSocketPoll, and according to

https://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx

"The parameter writefds identifies the sockets that are to be checked for
writability. If a socket is processing a connect call (nonblocking), a
socket is writeable if the connection establishment successfully
completes."

On the other hand, in libpqwalreceiver, we're depending on latch.c's
implementation, and it uses WSAEventSelect's FD_WRITE event:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx

If I'm reading that correctly, FD_WRITE is set instantly by the connect
request, probably even in the nonblock case, and it only gets cleared
by a failed write request. It looks to me like we would have to
specifically look for FD_CONNECT, *not* FD_WRITE, to make this work.

This is problematic, because the APIs in between don't provide a way
to report that we're still waiting for connect rather than for
data-write-ready. Anybody have the stomach for extending PQconnectPoll's
API with an extra PGRES_POLLING_CONNECTING state? If not, can we tell in
WaitEventAdjustWin32 that the socket is still connecting and we must
substitute FD_CONNECT for FD_WRITE?

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2017-08-06 17:14:36 Re: Replication to Postgres 10 on Windows is broken
Previous Message Tom Lane 2017-08-06 15:52:20 Re: Replication to Postgres 10 on Windows is broken

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-08-06 17:14:36 Re: Replication to Postgres 10 on Windows is broken
Previous Message Tom Lane 2017-08-06 15:52:20 Re: Replication to Postgres 10 on Windows is broken