High SYS CPU - need advise

From: Vlad <marchenko(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: High SYS CPU - need advise
Date: 2012-11-14 21:13:34
Message-ID: CAKeSUqW1h1KeBL-4xOLDX_B=K+v71Sf5mze_A8E__TEuqzpN3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello everyone,

I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:

Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not irq, not user). Postgresql is
taking 10-15% at the same time. Those periods would last from few seconds,
to minutes or until Postgresql is restarted. Needless to say that system is
barely responsive, with load average hitting over 100. We have mostly
select statements (joins across few tables), using indexes and resulting in
a small number of records returned. Should number of requests per second
coming drop a bit, server does not fall into those HIGH-SYS-CPU periods. It
all seems like postgres runs out of some resources or fighting for some
locks and that causing kernel to go into la-la land trying to manage it.

So far we've checked:
- disk and nic delays / errors / utilization
- WAL files (created rarely)
- tables are vacuumed OK. periods of high SYS not tied to vacuum process.
- kernel resources utilization (sufficient FS handles, shared MEM/SEM, VM)
- increased log level, but nothing suspicious/different (to me) is reported
there during periods of high sys-cpu
- ran pgbench (could not reproduce the issue, even though it was producing
over 40,000 TPS for prolonged period of time)

Basically, our symptoms are exactly as was reported here over a year ago
(though for postgres 8.3, we ran 9.1):
http://archives.postgresql.org/pgsql-general/2011-10/msg00998.php

I will be grateful for any ideas helping to resolve or diagnose this
problem.

Environment background:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

Linux 3.5.5 (Fedora 17 x64) on 32Gb RAM / 8 cores

Default configuration changed:
max_connection = 1200
shared_buffers = 3200MB
temp_buffers = 18MB
max_prepared_transactions = 500
work_mem = 16MB
maintenance_work_mem = 64MB
max_files_per_process = 3000
wal_level = hot_standby
fsync = off
checkpoint_segments = 64
checkpoint_timeout = 15min
effective_cache_size = 8GB
default_statistics_target = 500

-- Vlad

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2012-11-14 21:21:14 Re: Failed Login Attempts parameter
Previous Message Frank Cavaliero 2012-11-14 21:04:34 Failed Login Attempts parameter