Re: An unkillable connection caused replication delay on my replica

Lists: pgsql-hackers
From: Shawn <shamccoy(at)amazon(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: An unkillable connection caused replication delay on my replica
Date: 2016-06-28 19:56:43
Message-ID: 1467143803715-5909767.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Postgres 9.4.5
Streaming replica
hot_standby_feedback=1

I received an alarm that replication was lagging many hours. When I took a
look at what was happening on the replica I found the recovery process was
"stuck" on one WAL.

8719 8717 1 Apr30 ? 15:24:49 postgres: startup process
recovering 0000000100001E9700000096 waiting

An strace of the recovery pid showed a loop trying to kill the blocking pid:

select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout)
gettimeofday({1465523436, 442899}, NULL) = 0
kill(6819, SIGUSR1) = 0
select(0, NULL, NULL, NULL, {0, 5000}) = 0 (Timeout)
gettimeofday({1465523436, 448134}, NULL) = 0
kill(6819, SIGUSR1)

strace of long-running query pid 6819 loops like this:

sendto(10, "<query stuff removed>" NULL, 0) = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=8719,
si_uid=3001} ---
rt_sigreturn()

Where pid 6819 is a long (running for 2 days) running query. In
pg_stat_activity, it was still listed as "active". The query had a horrible
execution plan and it was being executed via a python script. I couldn't
pg_terminate_backend the connection. I didn't try to "kill -9" it due to
all the warnings about that and I felt I had something special here. I
attached the debugger.

Program received signal SIGINT, Interrupt.
0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#0 0x00002b89d9c62903 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000076e6da in pg_usleep (microsec=microsec(at)entry=5000) at
pgsleep.c:53
#2 0x0000000000655f8e in ResolveRecoveryConflictWithVirtualXIDs
(waitlist=0x2b89dc238000,
reason=reason(at)entry=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at standby.c:247
#3 0x000000000065609e in ResolveRecoveryConflictWithVirtualXIDs
(reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=<optimized out>) at
standby.c:283
#4 ResolveRecoveryConflictWithSnapshot (latestRemovedXid=<optimized out>,
node=...) at standby.c:281
#5 0x000000000048ff7d in heap_xlog_clean (record=0x2b89dc1295c0,
lsn=33636416003112) at heapam.c:7329
#6 heap2_redo (lsn=33636416003112, record=0x2b89dc1295c0) at heapam.c:8487
#7 0x00000000004c5fad in StartupXLOG () at xlog.c:6948
#8 0x000000000061f2bf in StartupProcessMain () at startup.c:224
#9 0x00000000004d0188 in AuxiliaryProcessMain (argc=argc(at)entry=2,
argv=argv(at)entry=0x7fff98c4cba0) at bootstrap.c:422
#10 0x000000000061c8de in StartChildProcess (type=StartupProcess) at
postmaster.c:5256
#11 0x000000000061edd2 in PostmasterMain (argc=argc(at)entry=7,
argv=argv(at)entry=0x2b89dc024040) at postmaster.c:1292
#12 0x00000000004622e5 in main (argc=7, argv=0x2b89dc024040) at main.c:228
Continuing.

Which I believe puts me in this section of code:


https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/storage/ipc/standby.c#L232

The customer was able to restart recovery but only after they "kill -9" the
python script on their application host.

My question: Is there something special about this connection that prevents
recovery from being able to kill it like it wanted to? I am trying to repro
it, but, so far my connections get terminated properly. Can/should Postgres
be more forceful when its determined WaitExceedsMaxStandbyDelay is true and
a session shouldn't be blocking recovery?

Shawn

--
View this message in context: http://postgresql.nabble.com/An-unkillable-connection-caused-replication-delay-on-my-replica-tp5909767.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Shawn <shamccoy(at)amazon(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An unkillable connection caused replication delay on my replica
Date: 2016-06-28 20:05:11
Message-ID: 20160628200511.GA52566@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shawn wrote:

> strace of long-running query pid 6819 loops like this:
>
> sendto(10, "<query stuff removed>" NULL, 0) = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=8719,
> si_uid=3001} ---
> rt_sigreturn()
>
> Where pid 6819 is a long (running for 2 days) running query. In
> pg_stat_activity, it was still listed as "active". The query had a horrible
> execution plan and it was being executed via a python script. I couldn't
> pg_terminate_backend the connection. I didn't try to "kill -9" it due to
> all the warnings about that and I felt I had something special here. I
> attached the debugger.

Did you happen to grab a stack trace of PID 6819?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Shawn <shamccoy(at)amazon(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An unkillable connection caused replication delay on my replica
Date: 2016-06-29 14:56:35
Message-ID: 1467212195613-5909883.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Unfortunately...no. I have been trying to repro this scenario. Is there a
specific way to make a Python connection where this is possible?

My end game, if this is not something that can be fixed on the Postgres
side, is to come up with a way to automatically cause the connection to drop
(iptables, something else...etc).

--
View this message in context: http://postgresql.nabble.com/An-unkillable-connection-caused-replication-delay-on-my-replica-tp5909767p5909883.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Shawn <shamccoy(at)amazon(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An unkillable connection caused replication delay on my replica
Date: 2016-06-29 18:39:26
Message-ID: 20160629183926.GA125026@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shawn wrote:
> Unfortunately...no. I have been trying to repro this scenario. Is there a
> specific way to make a Python connection where this is possible?
>
> My end game, if this is not something that can be fixed on the Postgres
> side, is to come up with a way to automatically cause the connection to drop
> (iptables, something else...etc).

Well. I was toying with a server a few days ago that had two processes
with queries "running" for eight days, but the client had long since
disconnected -- but the TCP keepalives didn't fire, because the kernel
was stuck trying to get free memory for something but the outgoing
network buffer was full so nothing was happening. That looked like a
kernel bug to me, and I don't think iptables would help you there.
Signalling the Postgres process in the normal way didn't do anything, as
they were stuck in the kernel.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services