Streaming Replication on win32

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Streaming Replication on win32
Date: 2010-01-17 13:58:14
Message-ID: 9837222c1001170558r338847b4h460a98115ab98d5b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm trying to figure out why streaming replication doesn't work on
win32. Here is what I have so far: It starts up fine, and outputs:
LOG: starting archive recovery
LOG: standby_mode = 'on'
LOG: primary_conninfo = 'host=localhost port=5432'
LOG: starting streaming recovery at 0/2000000

After this, *nothing* happens, and it never reaches a consistent state
or anything.

Looking at stacktraces, I notice two things:
walreceiver process is in:
ntdll!ZwWaitForSingleObject+0xa
mswsock+0x4f65
WS2_32!select+0x105
LIBPQ!pqSocketPoll(int sock = 4936, int forRead = 1, int forWrite = 0,
int64 end_time = -1)+0x2bb
LIBPQ!pqSocketCheck(struct pg_conn * conn = 0x00000000`00830160, int
forRead = 1, int forWrite = 0, int64 end_time = -1)+0xa1
LIBPQ!pqWaitTimed(int forRead = 1, int forWrite = 0, struct pg_conn *
conn = 0x00000000`00830160, int64 finish_time = -1)+0x2e
LIBPQ!pqWait(int forRead = 1, int forWrite = 0, struct pg_conn * conn
= 0x00000000`00830160)+0x2a
LIBPQ!PQgetResult(struct pg_conn * conn = 0x00000000`00830160)+0x82
LIBPQ!PQexecFinish(struct pg_conn * conn = 0x00000000`00830160)+0x1c
LIBPQ!PQexec(struct pg_conn * conn = 0x00000000`00830160, char * query
= 0x00000000`0042f600 "START_REPLICATION 0/2000000")+0x44
walreceiver!WalRcvConnect(void)+0x457
walreceiver!WalReceiverMain(struct FunctionCallInfoData * fcinfo =
0x00000000`00000000)+0x20e
postgres!AuxiliaryProcessMain(int argc = 2, char ** argv =
0x00000000`0081f080)+0x600
postgres!SubPostmasterMain(int argc = 4, char ** argv =
0x00000000`0081f070)+0x2d7
postgres!main(int argc = 4, char ** argv = 0x00000000`0081f070)+0x1e4
postgres!__tmainCRTStartup(void)+0x192
postgres!mainCRTStartup(void)+0xe
kernel32!BaseProcessStart+0x2c

Which shows one potentially big problem - since we're calling select()
from inside libpq, it's not calling our "signal emulation layer
compatible select()". This means that at this point, walreceiver is
not interruptible. Which also shows itself if I shut down the system -
the walreceiver stays around, and won't terminate properly. Do we need
to invent a way for libpq to call back into backend code to do this
select? We certainly can't have libipq use our version directly -
since that would break all non-postmaster/postgres processes.

The second thing I note is that the walsender is in:
ntdll!ZwWaitForMultipleObjects+0xa
kernel32!ReleaseSemaphore+0x6b
postgres!pgwin32_waitforsinglesocket(unsigned int64 s = 0x13fc, int
what = 41, int timeout = -1)+0x275
postgres!pgwin32_recv(unsigned int64 s = 0x13fc, char * buf =
0x00000000`0042f990 "???", int len = 1, int f = 0)+0xf5
postgres!secure_read(struct Port * port = 0x00000000`0042fcf0, void *
ptr = 0x00000000`0042f990, unsigned int64 len = 1)+0x32
postgres!pq_getbyte_if_available(unsigned char * c =
0x00000000`0042f990 "???")+0x106
postgres!CheckClosedConnection(void)+0x10
postgres!WalSndLoop(void)+0xdf
postgres!WalSenderMain(void)+0xb9
postgres!PostgresMain(int argc = 2, char ** argv =
0x00000000`0084d520, char * username = 0x00000000`0082e218
"Administrator")+0x3b5
postgres!BackendRun(struct Port * port = 0x00000000`0042fcf0)+0x235
postgres!SubPostmasterMain(int argc = 3, char ** argv =
0x00000000`0081f080)+0x278
postgres!main(int argc = 3, char ** argv = 0x00000000`0081f080)+0x1e4
postgres!__tmainCRTStartup(void)+0x192
postgres!mainCRTStartup(void)+0xe
kernel32!BaseProcessStart+0x2c

From what I can tell, this indicates that pq_getbyte_if_available() is
not working - because it's supposed to never block, right?

This could be because the win32 socket emulation layer simply wasn't
designed to deal with non-blocking sockets. Specifically, it actually
*always* sets the socket to non-blocking mode, and then uses that to
properly emulate how sockets work under unix.

Oh, and the walsender process says:
\Sessions\1\BaseNamedObjects\pgident(2196): postgres: wal sender
process Administrator 127.0.0.1(1398) startup
the walreceiver says:
\Sessions\1\BaseNamedObjects\pgident(2264): postgres: wal receiver process
and the startup process says:
\Sessions\1\BaseNamedObjects\pgident(2764): postgres: startup process
waiting for 000000010000000000000002

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2010-01-17 15:29:39 Re: Clearing global statistics
Previous Message Magnus Hagander 2010-01-17 12:56:25 Re: Archive recovery crashes on win32 in HEAD - hot standby related?