Occasional giant spikes in CPU load

From: Craig James <craig_james(at)emolecules(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Occasional giant spikes in CPU load
Date: 2010-04-07 21:37:22
Message-ID: 4BBCFB12.1010306@emolecules.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Most of the time Postgres runs nicely, but two or three times a day we get a huge spike in the CPU load that lasts just a short time -- it jumps to 10-20 CPU loads. Today it hit 100 CPU loads. Sometimes days go by with no spike events. During these spikes, the system is completely unresponsive (you can't even login via ssh).

I managed to capture one such event using top(1) with the "batch" option as a background process. See output below - it shows 19 active postgress processes, but I think it missed the bulk of the spike.

For some reason, every postgres backend suddenly decides (is told?) to do something. When this happens, the system become unusable for anywhere from ten seconds to a minute or so, depending on how much web traffic stacks up behind this event. We have two servers, one offline and one public, and they both do this, so it's not caused by actual web traffic (and the Apache logs don't show any HTTP activity correlated with the spikes).

I thought based on other posts that this might be a background-writer problem, but it's not I/O, it's all CPU as far as I can tell.

Any ideas where I can look to find what's triggering this?

8 CPUs, 8 GB memory
8-disk RAID10 (10k SATA)
Postgres 8.3.0
Fedora 8, kernel is 2.6.24.4-64.fc8
Diffs from original postgres.conf:

max_connections = 1000
shared_buffers = 2000MB
work_mem = 256MB
max_fsm_pages = 16000000
max_fsm_relations = 625000
synchronous_commit = off
wal_buffers = 256kB
checkpoint_segments = 30
effective_cache_size = 4GB
escape_string_warning = off

Thanks,
Craig

top - 11:24:59 up 81 days, 20:27, 4 users, load average: 0.98, 0.83, 0.92
Tasks: 366 total, 20 running, 346 sleeping, 0 stopped, 0 zombie
Cpu(s): 30.6%us, 1.5%sy, 0.0%ni, 66.3%id, 1.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8194800k total, 8118688k used, 76112k free, 36k buffers
Swap: 2031608k total, 169348k used, 1862260k free, 7313232k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18972 postgres 20 0 2514m 11m 8752 R 11 0.1 0:00.35 postmaster
10618 postgres 20 0 2514m 12m 9456 R 9 0.2 0:00.54 postmaster
10636 postgres 20 0 2514m 11m 9192 R 9 0.1 0:00.45 postmaster
25903 postgres 20 0 2514m 11m 8784 R 9 0.1 0:00.21 postmaster
10626 postgres 20 0 2514m 11m 8716 R 6 0.1 0:00.45 postmaster
10645 postgres 20 0 2514m 12m 9352 R 6 0.2 0:00.42 postmaster
10647 postgres 20 0 2514m 11m 9172 R 6 0.1 0:00.51 postmaster
18502 postgres 20 0 2514m 11m 9016 R 6 0.1 0:00.23 postmaster
10641 postgres 20 0 2514m 12m 9296 R 5 0.2 0:00.36 postmaster
10051 postgres 20 0 2514m 13m 10m R 4 0.2 0:00.70 postmaster
10622 postgres 20 0 2514m 12m 9216 R 4 0.2 0:00.39 postmaster
10640 postgres 20 0 2514m 11m 8592 R 4 0.1 0:00.52 postmaster
18497 postgres 20 0 2514m 11m 8804 R 4 0.1 0:00.25 postmaster
18498 postgres 20 0 2514m 11m 8804 R 4 0.1 0:00.22 postmaster
10341 postgres 20 0 2514m 13m 9m R 2 0.2 0:00.57 postmaster
10619 postgres 20 0 2514m 12m 9336 R 1 0.2 0:00.38 postmaster
15687 postgres 20 0 2321m 35m 35m R 0 0.4 8:36.12 postmaster

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joshua D. Drake 2010-04-07 21:40:10 Re: Occasional giant spikes in CPU load
Previous Message Robert Haas 2010-04-07 21:13:18 Re: indexes in partitioned tables - again