Re: BUG #9635: Wal sender process is using 100% CPU

From: Jamie Koceniak <jkoceniak(at)mediamath(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #9635: Wal sender process is using 100% CPU
Date: 2014-05-06 16:44:42
Message-ID: 84CE67AB46ACDA41B18094177DD0853D212A5A@ORD2MBX04C.mex05.mlsrvr.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Heikki,

Attached is a trace route for one of the pids that was running at 100% cpu.
Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until the wal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in our application.

Here is perf top results:
39.49% libz.so.1.2.7 [.] 0x000000000000301e
32.97% postgres [.] 0x00000000000cf150
6.50% postgres [.] nocachegetattr
2.63% postgres [.] btint4cmp
1.64% postgres [.] _start
1.40% [unknown] [.] 0x00007f9dfca6a34b
1.06% libc-2.13.so [.] 0x000000000012934b
0.94% postgres [.] slot_getattr
0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f
0.69% postgres [.] ExecProject
0.67% libz.so.1.2.7 [.] adler32
0.66% postgres [.] GetMemoryChunkSpace
0.58% [kernel] [k] copy_user_generic_string
0.45% postgres [.] heap_fill_tuple
0.38% postgres [.] execTuplesMatch
0.27% postgres [.] FunctionCall2Coll
0.27% postgres [.] ExecProcNode
0.25% postgres [.] heap_tuple_untoast_attr
0.22% postgres [.] MemoryContextAlloc
0.21% postgres [.] heap_form_minimal_tuple
0.21% postgres [.] pfree
0.19% postgres [.] heap_compute_data_size

Here is some data using perf record and perf report:

+ 33.93% 193019 postgres postgres [.] 0x0000000000369e27
+ 30.06% 173825 postgres libz.so.1.2.7 [.] 0x00000000000036c0
+ 10.21% 60618 postgres [unknown] [.] 0x00007f9dfc2fd529
+ 4.81% 27275 postgres postgres [.] nocachegetattr
+ 2.55% 14443 postgres postgres [.] btint4cmp
+ 1.22% 6916 postgres postgres [.] ExecProject
+ 1.20% 6837 postgres postgres [.] _start
+ 1.10% 6495 postgres libc-2.13.so [.] 0x000000000012b8e9
+ 0.84% 4760 postgres postgres [.] heap_fill_tuple
+ 0.74% 4265 postgres [kernel.kallsyms] [k] copy_user_generic_string
+ 0.71% 4001 postgres postgres [.] GetMemoryChunkSpace
+ 0.66% 3881 postgres libcrypto.so.1.0.0 [.] 0x0000000000081ec7
+ 0.63% 3673 postgres libz.so.1.2.7 [.] adler32
+ 0.43% 2448 postgres postgres [.] slot_getattr
+ 0.40% 2277 postgres postgres [.] ExecProcNode
+ 0.40% 2250 postgres postgres [.] heap_form_minimal_tuple
+ 0.37% 2078 postgres postgres [.] heap_compute_data_size
+ 0.31% 1739 postgres postgres [.] ExecScan
+ 0.27% 1779 postgres [kernel.kallsyms] [k] page_fault
+ 0.24% 1374 postgres postgres [.] LogicalTapeWrite
+ 0.24% 4206 swapper [kernel.kallsyms] [k] intel_idle
+ 0.21% 1206 postgres postgres [.] slot_getsomeattrs

Also, here are our wal settings on the master.
Do these settings appear ok?

wal_block_size | 8192 |
wal_buffers | 16MB |
wal_keep_segments | 5000 |
wal_level | hot_standby |
wal_receiver_status_interval | 10s |
wal_segment_size | 16MB |
wal_sender_delay | 1s |
wal_sync_method | fdatasync |
wal_writer_delay | 200ms |

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas(at)vmware(dot)com]
Sent: Tuesday, May 06, 2014 4:01 AM
To: Jamie Koceniak
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 05/05/2014 10:37 PM, Jamie Koceniak wrote:
> Hi Heikki,
>
> I performed an strace on one of the pids that was consuming 100% of the cpu.
> This is just a portion of the log.
> I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
> A majority of the calls in the trace log were recvfrom() and sendto().
> EAGAIN (Resource temporarily unavailable) came up time and time again.
>
> # strace log during 100% cpu utilization on the master

Hmm, that trace looks normal to me. It doesn't look like it's busy-looping, as I would expect if it's using 100% of the CPU.

Can you post the full log, or at least a larger chunk? I don't need the data in the quotes from read() and sends() that you removed, but I would like to see the trace from a longer period of time to see what the pattern is.

- Heikki

Attachment Content-Type Size
wal_sender_trace.txt text/plain 920.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2014-05-06 17:32:35 Re: BUG #9635: Wal sender process is using 100% CPU
Previous Message Alvaro Herrera 2014-05-06 16:35:17 Re: Problem with PostgreSQL 9.2.7 and make check on AIX 7.1