High SYS CPU - need advise

Lists: pgsql-general
From: Vlad <marchenko(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: High SYS CPU - need advise
Date: 2012-11-14 21:13:34
Message-ID: CAKeSUqW1h1KeBL-4xOLDX_B=K+v71Sf5mze_A8E__TEuqzpN3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello everyone,

I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:

Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not irq, not user). Postgresql is
taking 10-15% at the same time. Those periods would last from few seconds,
to minutes or until Postgresql is restarted. Needless to say that system is
barely responsive, with load average hitting over 100. We have mostly
select statements (joins across few tables), using indexes and resulting in
a small number of records returned. Should number of requests per second
coming drop a bit, server does not fall into those HIGH-SYS-CPU periods. It
all seems like postgres runs out of some resources or fighting for some
locks and that causing kernel to go into la-la land trying to manage it.

So far we've checked:
- disk and nic delays / errors / utilization
- WAL files (created rarely)
- tables are vacuumed OK. periods of high SYS not tied to vacuum process.
- kernel resources utilization (sufficient FS handles, shared MEM/SEM, VM)
- increased log level, but nothing suspicious/different (to me) is reported
there during periods of high sys-cpu
- ran pgbench (could not reproduce the issue, even though it was producing
over 40,000 TPS for prolonged period of time)

Basically, our symptoms are exactly as was reported here over a year ago
(though for postgres 8.3, we ran 9.1):
http://archives.postgresql.org/pgsql-general/2011-10/msg00998.php

I will be grateful for any ideas helping to resolve or diagnose this
problem.

Environment background:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

Linux 3.5.5 (Fedora 17 x64) on 32Gb RAM / 8 cores

Default configuration changed:
max_connection = 1200
shared_buffers = 3200MB
temp_buffers = 18MB
max_prepared_transactions = 500
work_mem = 16MB
maintenance_work_mem = 64MB
max_files_per_process = 3000
wal_level = hot_standby
fsync = off
checkpoint_segments = 64
checkpoint_timeout = 15min
effective_cache_size = 8GB
default_statistics_target = 500

-- Vlad


From: John R Pierce <pierce(at)hogranch(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-14 21:23:46
Message-ID: 50A40BE2.3010504@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/14/12 1:13 PM, Vlad wrote:
> Postgresql 9.1.6.
> Postgres usually has 400-500 connected clients, most of them are idle.
> Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.

thats a really high client connection count for a 8 core system.

I'd consider implementing a connection pool (like pgbouncer), and
rewriting your client applications to connect, do a transaction,
disconnect, so the actual number of postgres connections is much lower,
say in the 16-48 range.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast


From: Vlad <marchenko(at)gmail(dot)com>
To:
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-14 21:34:15
Message-ID: CAKeSUqXV9gssaAi27uBVAYj1Ckhv7C8eBEW8vv-SDfWXUD+MTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

John,

thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I do

select * from pg_stat_activity where current_query not like '%IDLE%';

I only see several active queries at any given time.

-- Vlad

On Wed, Nov 14, 2012 at 3:23 PM, John R Pierce <pierce(at)hogranch(dot)com> wrote:

> On 11/14/12 1:13 PM, Vlad wrote:
>
>> Postgresql 9.1.6.
>> Postgres usually has 400-500 connected clients, most of them are idle.
>> Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
>>
>
> thats a really high client connection count for a 8 core system.
>
> I'd consider implementing a connection pool (like pgbouncer), and
> rewriting your client applications to connect, do a transaction,
> disconnect, so the actual number of postgres connections is much lower, say
> in the 16-48 range.
>
>


From: John R Pierce <pierce(at)hogranch(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-14 22:08:18
Message-ID: 50A41652.3020509@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/14/12 1:34 PM, Vlad wrote:
> thanks for your feedback. While implementing connection pooling would
> make resources utilization more efficient, I don't think it's the root
> of my problem. Most of the connected clients are at IDLE. When I do
>
> select * from pg_stat_activity where current_query not like '%IDLE%';
>
> I only see several active queries at any given time.

what about during these spikes?

--
john r pierce N 37, W 122
santa cruz ca mid-left coast


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: John R Pierce <pierce(at)hogranch(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 17:28:36
Message-ID: CAHyXU0yPwv72j9pBxKjKccyTWA3wVzN-yM7cZ14pgFK7ARQXPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce <pierce(at)hogranch(dot)com> wrote:
> On 11/14/12 1:34 PM, Vlad wrote:
>>
>> thanks for your feedback. While implementing connection pooling would make
>> resources utilization more efficient, I don't think it's the root of my
>> problem. Most of the connected clients are at IDLE. When I do
>>
>> select * from pg_stat_activity where current_query not like '%IDLE%';
>>
>> I only see several active queries at any given time.
>
>
> what about during these spikes?

Yeah -- if you are seeing a lot of queries pile up during these times,
it's time to explore connection pooler because it will keep system
load manageable during these situations.

things to check:
*) blocked queries (pg_locks/pg_stat_activity)
*) i/o wait. in particular, is linux page cache flushing.
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
*) you have a large amount of table? maybe this statistics file related?
*) let's log checkpoints to see if there is correlation with stall
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To:
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 17:50:53
Message-ID: CAKeSUqXYhYNi1WrJ1EShcChXobMPSez8fN9L4nPC4dDJzv+_HA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

there is no big spike of queries that cause that, queries come in
relatively stable pace. It's just when the higher rate of queries coming,
the more likely this to happen. yes, when stall happens , the active
queries pile up - but that's the result of a stall (the server reacts slow
on a keypress, not to mention queries execution), not the cause.

things to check:
> *) blocked queries (pg_locks/pg_stat_activity)
>

nada

> *) i/o wait. in particular, is linux page cache flushing.
>

no i/o wait, no IRQ (see below)

> *) query stampede: we will want to measure TPS leading into the stall
> and out of it.
> *) anything else running on the box?
>

just bare linux + postgresql.

> *) you have a large amount of table? maybe this statistics file related?
>

over 1000 tables across 4 or 5 schemas in a single database.

*) let's log checkpoints to see if there is correlation with stall
>

checked, checkpoints happen must more rarely and w/o relation to a high-sys
periods

> *) nice to have vmstat/iostat during/before/after stall. for example,
> massive spike of context switches during stall could point to o/s
> scheduler issue.
>

checked that as well - nothing. CS even lower.

avg-cpu: %user %nice %system %iowait %steal %idle
16.94 0.00 9.28 0.38 0.00 73.40

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 6.00 48.00 0.00 48 0

avg-cpu: %user %nice %system %iowait %steal %idle
18.06 0.00 18.43 0.25 0.00 63.26

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 104.00 0.00 104 0

avg-cpu: %user %nice %system %iowait %steal %idle
9.12 0.00 * 86.74* 0.12 0.00 4.01

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.97 7.77 0.00 8 0

avg-cpu: %user %nice %system %iowait %steal %idle
1.44 0.00 *96.58* 0.00 0.00 1.98

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.28 78.69 0.00 144 0

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id
wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 *86
*5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 *98
*0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 *94
*5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 *90
*1 0 0


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 20:11:55
Message-ID: CAHyXU0wD3P3rfNWYo3c-kTZpmMWceK7XQh1hLkaz4MEMtfcvrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 11:50 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> there is no big spike of queries that cause that, queries come in relatively
> stable pace. It's just when the higher rate of queries coming, the more
> likely this to happen. yes, when stall happens , the active queries pile up
> - but that's the result of a stall (the server reacts slow on a keypress,
> not to mention queries execution), not the cause.
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id
> wa st
> 1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9
> 76 0 0
> 4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9
> 79 1 0
> 0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10
> 73 0 0
> 6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17
> 64 0 0
> 30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 86
> 5 0 0
> 72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 98
> 0 0 0
> 78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 94
> 5 0 0
> 22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 90
> 1 0 0

hm. well, we can definitely rule out i/o. I reviewed your last
posting, and you said:
"Out of the top 50 processes in top, 48 of them are postmasters, one
is syslog, and one is psql. Each of the postmasters have a high %CPU,
the top ones being 80% and higher, the rest being anywhere between 30%
- 60%. Would postmaster 'queries' that are running attribute to the
sys CPU usage, or should they be under the 'us' CPU usage?"

Is this still true? Can we capture strace from one of the high %
postmasters to see if there's any clues there. Maybe we've uncovered
some type of weird spinlock contention issue. How large is your
database (or at least the portion of it that's commonly used)? Would
you characterize your queries as mostly small lookups, scans, or a
mix?

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 20:20:31
Message-ID: CAKeSUqV2VeYkJCv7_LtYGAHNBBBgm252Y1BctAzXbV-aGgzSuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Merlin,

this is not my report, probably from a thread that I've referenced as
having a common symptoms. Here is info about my db:

Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We have mostly select statements (joins across few tables), using indexes
and resulting in a small number of records returned.

So mostly small lookups across few tables joining by a primary key / index.
Also, I just perform an experiment - I switched our app over to using hot
PITR instead of master and it experienced the same problem. So since PITR
db can only perform read-only queries, there is no write-locks (except
maybe when pitr is playing wal records from the master?), nevertheless SYS
CPU jumped sky.

-- Vlad

>
> Is this still true? Can we capture strace from one of the high %
> postmasters to see if there's any clues there. Maybe we've uncovered
> some type of weird spinlock contention issue. How large is your
> database (or at least the portion of it that's commonly used)? Would
> you characterize your queries as mostly small lookups, scans, or a
> mix?
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 20:31:07
Message-ID: CAHyXU0zNoCre4oNdpPacjz0nOQQASuwJzQWTGNcJEWrxUd_BDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 2:20 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> Merlin,
>
> this is not my report, probably from a thread that I've referenced as having
> a common symptoms. Here is info about my db:
>
>
> Postgresql 9.1.6.
> Postgres usually has 400-500 connected clients, most of them are idle.
> Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
> We have mostly select statements (joins across few tables), using indexes
> and resulting in a small number of records returned.
>
> So mostly small lookups across few tables joining by a primary key / index.
> Also, I just perform an experiment - I switched our app over to using hot
> PITR instead of master and it experienced the same problem. So since PITR db
> can only perform read-only queries, there is no write-locks (except maybe
> when pitr is playing wal records from the master?), nevertheless SYS CPU
> jumped sky.

yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 20:44:23
Message-ID: CAKeSUqV9gOTKU8g5XPhDimWa64+sqZUozgcd_JtJjkap-P50BA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>
> yeah. ok, nest steps:
> *) can you confirm that postgres process is using high cpu (according
> to top) during stall time
>

yes, CPU is spread across a lot of postmasters

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster

> *) if, so, please strace that process and save some of the log
>

https://dl.dropbox.com/u/109778/stall_postmaster.log

> *) you're using a 'bleeding edge' kernel. so we must be suspicious of
> a regression there, particularly in the scheduler.
>

this was observed for a while, during which period system went from using
3.4.* kernels to 3.5.*... but I do not deny such a possibility.

> *) I am suspicious of spinlock issue. so, if we can't isolate the
> problem, is running a hand complied postgres a possibility (for lock
> stats)?
>

Yes, definitely possible. we run manually compiled postgresql anyway. Pls,
provide instructions.

> *) what is the output of this:
> echo /proc/sys/vm/zone_reclaim_mode
>
>
I presume you wanted cat instead of echo, and it shows 0.

-- vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 21:49:42
Message-ID: CAHyXU0zfAER2S62z11kUnQ+TFtzg4PvBW1Ptcm_vVVpXgCqdMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>
>>
>> yeah. ok, nest steps:
>> *) can you confirm that postgres process is using high cpu (according
>> to top) during stall time
>
>
> yes, CPU is spread across a lot of postmasters
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
> 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
> 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
> 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
> 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
> 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
> 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
> 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster
>
>>
>> *) if, so, please strace that process and save some of the log
>
>
> https://dl.dropbox.com/u/109778/stall_postmaster.log

ok, excellent. reviewing the log, this immediately caught my eye:

recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
64, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
semop(41713721, {{2, 1, 0}}, 1) = 0
lseek(295, 0, SEEK_END) = 0
lseek(296, 0, SEEK_END) = 8192

this is definitely pointing to spinlock issue. see: slock.c what you
are seeing here is a backend getting caught in a spin loop via a stuck
spinlock. 0-9 = 10 times. did you by any chance check the logs? did
any backends restart? we are looking for this:

elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
lock, file, line);

Anything else going on in the log? The way to enable locks status is
via LWLOCK_STATS macro. But before doing any thing plz check logs for
error/panic.

merlin


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 22:16:06
Message-ID: CAHyXU0y4LCV3uTBZp9+5iX-WL5fSQVmrKeQtaGqTPQtVJ7V_DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>
>>>
>>> yeah. ok, nest steps:
>>> *) can you confirm that postgres process is using high cpu (according
>>> to top) during stall time
>>
>>
>> yes, CPU is spread across a lot of postmasters
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
>> 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
>> 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
>> 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
>> 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
>> 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
>> 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
>> 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster
>>
>>>
>>> *) if, so, please strace that process and save some of the log
>>
>>
>> https://dl.dropbox.com/u/109778/stall_postmaster.log
>
> ok, excellent. reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1) = 0
> lseek(295, 0, SEEK_END) = 0
> lseek(296, 0, SEEK_END) = 8192
>
> this is definitely pointing to spinlock issue. see: slock.c what you
> are seeing here is a backend getting caught in a spin loop via a stuck
> spinlock. 0-9 = 10 times. did you by any chance check the logs? did
> any backends restart? we are looking for this:
>
> elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
> lock, file, line);
>
> Anything else going on in the log? The way to enable locks status is
> via LWLOCK_STATS macro. But before doing any thing plz check logs for
> error/panic.

ah, scratch that. reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times). so the panic is
unlikely to be in the log. also spinlock waits shouldn't increase by
exactly 1000 us. investigating...

merlin


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 22:29:50
Message-ID: 20121115222950.GC14789@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Merlin Moncure escribió:

> ok, excellent. reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1) = 0
> lseek(295, 0, SEEK_END) = 0
> lseek(296, 0, SEEK_END) = 8192
>
> this is definitely pointing to spinlock issue.

I met Rik van Riel (Linux kernel hacker) recently and we chatted about
this briefly. He strongly suggested that we should consider using
futexes on Linux instead of spinlocks; the big advantage being that
futexes sleep instead of spinning when contention is high. That would
reduce the system load in this scenario.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 22:31:29
Message-ID: CAKeSUqWeo6pGYW+5eXZt=trFTnfDyg-3pKqh5qsDPFGyTCFZ=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

sorry - no panics / errors in the log...

-- Vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 22:44:46
Message-ID: CAHyXU0yshER2j=U5ahC443btRWLcvOFiB=R9iWO0Y=dKRiveNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Merlin Moncure escribió:
>
>> ok, excellent. reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
>> 64, 0, NULL, NULL) = 64
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
>> semop(41713721, {{2, 1, 0}}, 1) = 0
>> lseek(295, 0, SEEK_END) = 0
>> lseek(296, 0, SEEK_END) = 8192
>>
>> this is definitely pointing to spinlock issue.
>
> I met Rik van Riel (Linux kernel hacker) recently and we chatted about
> this briefly. He strongly suggested that we should consider using
> futexes on Linux instead of spinlocks; the big advantage being that
> futexes sleep instead of spinning when contention is high. That would
> reduce the system load in this scenario.

Well, so do postgres spinlocks right? When we overflow
spins_per_delay we go to pg_usleep which proxies to select() --
postgres spinlocks are a hybrid implementation. Moving to futex is
possible improvement (that's another discussion) in degenerate cases
but I'm not sure that I've exactly zeroed in on the problem. Or am I
missing something?

What I've been scratching my head over is what code exactly would
cause an iterative sleep like the above. The code is here:

pg_usleep(cur_delay * 1000L);

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

...so cur_delay is supposed to increase in non linear fashion. I've
looked around the sleep, usleep, and latch calls as of yet haven't
found anything that advances timeout just like that (yet, need to do
another pass). And we don't know for sure if this is directly related
to OP's problem.

merlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 23:52:51
Message-ID: 27746.1353023571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> What I've been scratching my head over is what code exactly would
> cause an iterative sleep like the above. The code is here:

> pg_usleep(cur_delay * 1000L);

> /* increase delay by a random fraction between 1X and 2X */
> cur_delay += (int) (cur_delay *
> ((double) random() / (double) MAX_RANDOM_VALUE) + 0.5);
> /* wrap back to minimum delay when max is exceeded */
> if (cur_delay > MAX_DELAY_MSEC)
> cur_delay = MIN_DELAY_MSEC;

IIRC that coding is of relatively recent vintage. The OP is probably
running some older version that increased cur_delay differently.

regards, tom lane


From: Vlad Marchenko <marchenko(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 00:06:24
Message-ID: 50A58380.1050003@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom,

I just checked the version I'm running (9.1.6), and the code is quite
similar (src/backend/storage/lmgr/s_lock.c)

pg_usleep(cur_delay * 1000L);

#if defined(S_LOCK_TEST)
fprintf(stdout, "*");
fflush(stdout);
#endif

/* increase delay by a random fraction between 1X and 2X */
cur_delay += (int) (cur_delay *
((double) random() / (double) MAX_RANDOM_VALUE) +
0.5);
/* wrap back to minimum delay when max is exceeded */
if (cur_delay > MAX_DELAY_MSEC)
cur_delay = MIN_DELAY_MSEC;

-- vlad


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 00:07:14
Message-ID: CAMkU=1w3fq_oEDPGd17BEK2bKFvsG3+TuEB7Vb7Gc8ikmK9ipA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)

This is not entirely inconsistent with the spinlock. Note that 1000
is repeated 3 times, and 5000 is missing.

This might just be a misleading random sample we got here. I've seen
similar close spacing in some simulations I've run.

It is not clear to me why we use a resolution of 1 msec here. If the
OS's implementation of select() eventually rounds to the nearest msec,
that is its business. But why do we have to lose intermediate
precision due to its decision?

Cheers,

Jeff


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 15:28:32
Message-ID: CAHyXU0wJzGCdg9gKdRdnWvXaATia42BZa=DCLYhQ=u-qLtG++w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>
>>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
>>>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
>
> This is not entirely inconsistent with the spinlock. Note that 1000
> is repeated 3 times, and 5000 is missing.
>
> This might just be a misleading random sample we got here. I've seen
> similar close spacing in some simulations I've run.
>
> It is not clear to me why we use a resolution of 1 msec here. If the
> OS's implementation of select() eventually rounds to the nearest msec,
> that is its business. But why do we have to lose intermediate
> precision due to its decision?

Yeah -- you're right, this is definitely spinlock issue. Next steps:

*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is one thing to try.

*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding on. Hopefully
it's a lwlock -- this will make diagnosing the problem easier.

*) if we're not blocking on lwlock, it's possibly a buffer pin related
issue? I've seen this before, for example on an index scan that is
dependent on an seq scan. This long thread:
"http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
has a lot information about that case and deserves a review.

*) we can consider experimenting with futex
(http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
to see if things improve. This is dangerous, and could crash your
server/eat your data, so fair warning.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 15:52:48
Message-ID: 50A66150.2080805@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Merlin,

> Yeah -- you're right, this is definitely spinlock issue. Next steps:
>
> *) in mostly read workloads, we have a couple of known frequent
> offenders. In particular the 'BufFreelistLock'. One way we can
> influence that guy is to try and significantly lower/raise shared
> buffers. So this is one thing to try.

server has 32gb with 3.2gb dedicated for share buffers. I've increased
it to 13.2gb, the stall still happened (I have a way of controlling
number of queries hitting postgresql by shutting down own memcache-based
app cache, so to test for stall I temporary shut down few memcached
servers).

> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
> information about the particular lock(s) we're binding on. Hopefully
> it's a lwlock -- this will make diagnosing the problem easier.

I've enabled that macro, seeing flying lwlock messages in the log (see
below), even when there is no high-sys-cpu stall observed at the moment.
Should I be looking for something in particular?

PID 17293 lwlock 5906: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5932: shacq 1 exacq 0 blk 0
PID 17293 lwlock 5934: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10854: shacq 1 exacq 0 blk 0
PID 17293 lwlock 10856: shacq 4 exacq 0 blk 0
PID 17293 lwlock 10858: shacq 15 exacq 0 blk 0
PID 17293 lwlock 16442: shacq 4 exacq 0 blk 0
PID 17293 lwlock 16596: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20458: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20460: shacq 10 exacq 0 blk 0
PID 17293 lwlock 20464: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20466: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20480: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20482: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20484: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20508: shacq 12 exacq 0 blk 0
PID 17293 lwlock 20510: shacq 6 exacq 0 blk 0
PID 17293 lwlock 20938: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20940: shacq 22 exacq 0 blk 0
PID 17293 lwlock 20942: shacq 28 exacq 0 blk 0
PID 17293 lwlock 20944: shacq 56 exacq 0 blk 0
PID 17293 lwlock 20946: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20948: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20950: shacq 11 exacq 0 blk 0
PID 17293 lwlock 20952: shacq 3 exacq 0 blk 0
PID 17293 lwlock 20954: shacq 7 exacq 0 blk 0
PID 17293 lwlock 20956: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20958: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20960: shacq 1 exacq 0 blk 0
PID 17293 lwlock 20962: shacq 2 exacq 0 blk 0
PID 17293 lwlock 20964: shacq 2 exacq 0 blk 0
PID 17293 lwlock 46624: shacq 24 exacq 0 blk 0
PID 17293 lwlock 68126: shacq 6 exacq 0 blk 0

> *) if we're not blocking on lwlock, it's possibly a buffer pin related
> issue? I've seen this before, for example on an index scan that is
> dependent on an seq scan. This long thread:
> "http://postgresql.1045698.n5.nabble.com/9-2beta1-parallel-queries-ReleasePredicateLocks-CheckForSerializableConflictIn-in-the-oprofile-td5709812i100.html"
> has a lot information about that case and deserves a review.
>
> *) we can consider experimenting with futex
> (http://archives.postgresql.org/pgsql-hackers/2012-06/msg01588.php)
> to see if things improve. This is dangerous, and could crash your
> server/eat your data, so fair warning.
>
> merlin

-- vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 16:21:16
Message-ID: CAHyXU0zcPprBPVooYuwAELWHn5WdNVT8Egb0B1T=F0_deSWnSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> Merlin,
>
>
>> Yeah -- you're right, this is definitely spinlock issue. Next steps:
>>
>> *) in mostly read workloads, we have a couple of known frequent
>> offenders. In particular the 'BufFreelistLock'. One way we can
>> influence that guy is to try and significantly lower/raise shared
>> buffers. So this is one thing to try.
>
>
> server has 32gb with 3.2gb dedicated for share buffers. I've increased it to
> 13.2gb, the stall still happened (I have a way of controlling number of
> queries hitting postgresql by shutting down own memcache-based app cache, so
> to test for stall I temporary shut down few memcached servers).
>
>
>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>> information about the particular lock(s) we're binding on. Hopefully
>> it's a lwlock -- this will make diagnosing the problem easier.
>
>
> I've enabled that macro, seeing flying lwlock messages in the log (see
> below), even when there is no high-sys-cpu stall observed at the moment.
> Should I be looking for something in particular?

We're looking for spikes in 'blk' which represents when lwlocks bump.
If you're not seeing any then this is suggesting a buffer pin related
issue -- this is also supported by the fact that raising shared
buffers didn't help. If you're not seeing 'bk's, go ahead and
disable the stats macro.

So, what we need to know now is:
*) What happens when you drastically *lower* shared buffers? Say, to
64mb? Note, you may experience higher load for unrelated reasons and
have to scuttle the test. Also, if you have to crank higher to handle
internal server structures, do that. This is a hail mary, but maybe
something interesting spits out.

*) How many specific query plans are needed to introduce the
condition, Hopefully, it's not too many. If so, let's start
gathering the plans. If you have a lot of plans to sift through, one
thing we can attempt to eliminate noise is to tweak
log_min_duration_statement so that during stall times (only) it logs
offending queries that are unexpectedly blocking.

*) Approximately how big is your 'working set' -- the data your
queries are routinely hitting?

*) Is the distribution of the *types* of queries uniform? Or do you
have special processes that occur on intervals?

Thanks for your patience.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 17:19:05
Message-ID: CAKeSUqVuAOTs7vzEkdjDui-=BQPdPv9SD3NigJ4q+BD4C4m_DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> We're looking for spikes in 'blk' which represents when lwlocks bump.
> If you're not seeing any then this is suggesting a buffer pin related
> issue -- this is also supported by the fact that raising shared
> buffers didn't help. If you're not seeing 'bk's, go ahead and
> disable the stats macro.
>

most blk comes with 0, some with 1, few hitting 100. I can't say that
during stall times the number of blk 0 vs blk non-0 are very different.

> So, what we need to know now is:
> *) What happens when you drastically *lower* shared buffers? Say, to
> 64mb? Note, you may experience higher load for unrelated reasons and
> have to scuttle the test. Also, if you have to crank higher to handle
> internal server structures, do that. This is a hail mary, but maybe
> something interesting spits out.
>

lowering shared_buffers didn't help.

> *) How many specific query plans are needed to introduce the
> condition, Hopefully, it's not too many. If so, let's start
> gathering the plans. If you have a lot of plans to sift through, one
> thing we can attempt to eliminate noise is to tweak
> log_min_duration_statement so that during stall times (only) it logs
> offending queries that are unexpectedly blocking.
>

unfortunately, there are quite a few query plans... also, I don't think
setting log_min_duration_statement will help us, cause when server is
hitting high load average, it reacts slowly even on a key press. So even
non-offending queries will be taking long to execute. I see all sorts of
queries a being executed long during stall: spanning from simple
LOG: duration: 1131.041 ms statement: SELECT 'DBD::Pg ping test'
to complex ones, joining multiple tables.
We are still looking into all the logged queries in attempt to find the
ones that are causing the problem, I'll report if we find any clues.

>
> *) Approximately how big is your 'working set' -- the data your
> queries are routinely hitting?
>

I *think* it's within few hundreds MB range.

>
> *) Is the distribution of the *types* of queries uniform? Or do you
> have special processes that occur on intervals?
>

it's pretty uniform.

>
> Thanks for your patience.
>
>
oh no, thank you for trying to help me to resolve this issue.

-- vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 17:39:10
Message-ID: CAHyXU0wH+nN8g2+K2D0gyZ0tskepMpPjw3kQtXeZzd4yVOj4+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 11:19 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
>
>> We're looking for spikes in 'blk' which represents when lwlocks bump.
>> If you're not seeing any then this is suggesting a buffer pin related
>> issue -- this is also supported by the fact that raising shared
>> buffers didn't help. If you're not seeing 'bk's, go ahead and
>> disable the stats macro.
>
>
> most blk comes with 0, some with 1, few hitting 100. I can't say that during
> stall times the number of blk 0 vs blk non-0 are very different.

right. this is feeling more and more like a buffer pin issue. but
even then we can't be certain -- it could be symptom, not the cause.
to prove it we need to demonstrate that everyone is spinning and
waiting, which we haven't done. classic spinlock contention manifests
in high user cpu. we are binding in kernel, so I wonder if it's all
the select() calls. we haven't yet ruled out kernel regression.

If I were you, I'd be investigating pgbouncer to see if your app is
compliant with transaction mode processing, if for no other reason
than it will mitigate high load issues.

>> *) How many specific query plans are needed to introduce the
>> condition, Hopefully, it's not too many. If so, let's start
>> gathering the plans. If you have a lot of plans to sift through, one
>> thing we can attempt to eliminate noise is to tweak
>> log_min_duration_statement so that during stall times (only) it logs
>> offending queries that are unexpectedly blocking.
>
>
> unfortunately, there are quite a few query plans... also, I don't think
> setting log_min_duration_statement will help us, cause when server is
> hitting high load average, it reacts slowly even on a key press. So even
> non-offending queries will be taking long to execute. I see all sorts of
> queries a being executed long during stall: spanning from simple
> LOG: duration: 1131.041 ms statement: SELECT 'DBD::Pg ping test'
> to complex ones, joining multiple tables.
> We are still looking into all the logged queries in attempt to find the ones
> that are causing the problem, I'll report if we find any clues.

right.

merlin


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 18:26:43
Message-ID: CAMkU=1x3jPg9fDW8Ng_vdafYRKne0Y723HLTGa25wPRiyfvJWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>
>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>>> information about the particular lock(s) we're binding on. Hopefully
>>> it's a lwlock -- this will make diagnosing the problem easier.
>>
>>
>> I've enabled that macro, seeing flying lwlock messages in the log (see
>> below), even when there is no high-sys-cpu stall observed at the moment.
>> Should I be looking for something in particular?
>
> We're looking for spikes in 'blk' which represents when lwlocks bump.

Unfortunately LWLock contention is only a weak indication of spinlock
contention. For example, if backends are furiously acquiring and
releasing the same LWLock in shared mode and no one is getting an
exclusive mode, then there can be high spinlock contention but there
will never be a blk.

In 9.3 there is a new field that tells how many spin delays there were
on the mutex that is behind each lock. That was commit
b79ab00144e64217d41, maybe he can port that back to his version.

But that only tells you about LWLock mutexes, not about all the other
ones in PG.

The attached patch logs every spin delay with where in the source it comes from.

I don't think it will have too much of a performance impact as the
logs only happen when a sleep of at least 1ms is going to happen
anyway. But it could generate massive amounts of logs.

Cheers,

Jeff

Attachment Content-Type Size
spin_log.patch application/octet-stream 558 bytes

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 19:19:18
Message-ID: CAHyXU0xE0VGN7Vvgch5Uq9AiyfhSt8sVKEH8zaq20Z-0pe1h4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 12:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>>
>>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>>>> information about the particular lock(s) we're binding on. Hopefully
>>>> it's a lwlock -- this will make diagnosing the problem easier.
>>>
>>>
>>> I've enabled that macro, seeing flying lwlock messages in the log (see
>>> below), even when there is no high-sys-cpu stall observed at the moment.
>>> Should I be looking for something in particular?
>>
>> We're looking for spikes in 'blk' which represents when lwlocks bump.
>
> Unfortunately LWLock contention is only a weak indication of spinlock
> contention. For example, if backends are furiously acquiring and
> releasing the same LWLock in shared mode and no one is getting an
> exclusive mode, then there can be high spinlock contention but there
> will never be a blk.

yup. this problem doesn't smell like lwlock issues. typically there
the problem manifests as low cpu performance, everybody waits.
classic spinlock contention (at least from what i've seen) is very
high *userspace* cpu utilization and low work output. this time it's
different -- OP is bogging in the kernel so it's not impossible we're
troubleshooting the symptom, not the cause.

> In 9.3 there is a new field that tells how many spin delays there were
> on the mutex that is behind each lock. That was commit
> b79ab00144e64217d41, maybe he can port that back to his version.
>
> But that only tells you about LWLock mutexes, not about all the other
> ones in PG.
>
> The attached patch logs every spin delay with where in the source it comes from.

yeah, OP should fire this off. good stuff. I'll bet lunch (if we ever
happen to meet) it's on buffer pin.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 20:13:33
Message-ID: CAKeSUqW3R-ObCyW4a_W8nhUXPVUBZP36asUuw=aU5Kzds=yofg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

ok, I've applied that patch and ran. The stall started around 13:50:45...50
and lasted until the end

https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log

the actual log has more data (including statement following each 'spin
delay' record), but there is some sensitive info inside which I can't share
with public.

-- Vlad

yup. this problem doesn't smell like lwlock issues. typically there
> the problem manifests as low cpu performance, everybody waits.
> classic spinlock contention (at least from what i've seen) is very
> high *userspace* cpu utilization and low work output. this time it's
> different -- OP is bogging in the kernel so it's not impossible we're
> troubleshooting the symptom, not the cause.
>
> > In 9.3 there is a new field that tells how many spin delays there were
> > on the mutex that is behind each lock. That was commit
> > b79ab00144e64217d41, maybe he can port that back to his version.
> >
> > But that only tells you about LWLock mutexes, not about all the other
> > ones in PG.
> >
> > The attached patch logs every spin delay with where in the source it
> comes from.
>
> yeah, OP should fire this off. good stuff. I'll bet lunch (if we ever
> happen to meet) it's on buffer pin.
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 20:31:34
Message-ID: CAHyXU0yus50=7Bz81Dm7syuU1qohfTVcv=2fxZDBXj7M3UT-jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 2:13 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> the actual log has more data (including statement following each 'spin
> delay' record), but there is some sensitive info inside which I can't share
> with public.

first thoughts:
no single thing really stands out -- contention is all over the place.
lwlock, pinbuffer, dynahash (especially). I am again suspicious of
bad scheduler interaction. any chance we can fire up pgbouncer?

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 21:21:09
Message-ID: CAKeSUqUsx69dNxQZWSpT+XHm=oRWafKhmgrz+D3cTNhSP46rGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

what would pgbouncer do in my case? Number of connections will decrease,
but number of active clients won't be smaller. As I understand the latter
ones are that important.

-- Vlad

On Fri, Nov 16, 2012 at 2:31 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

>
> first thoughts:
> no single thing really stands out -- contention is all over the place.
> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
> bad scheduler interaction. any chance we can fire up pgbouncer?
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-16 21:32:35
Message-ID: CAHyXU0wo7z7oj5ZaTGukWKz8-h8fM9kdkauYt3AWmb5u19aSFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 3:21 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> what would pgbouncer do in my case? Number of connections will decrease, but
> number of active clients won't be smaller. As I understand the latter ones
> are that important.

Well, one thing that struck me was how little spinlock contention
there actually was. Yeah, backends are delaying here and there,
which isn't great, but but a few dozen delays per second across
several hundred backends doesn't seem like it should be pegging sys
cpu%. This is all pointing to the problem not being in postgres, but
in linux.

pgbouncer would do two things:
1) perhaps guard you against some o/s issue
2) keep system more responsive during stall (since by controlling the
pool you control the number of queries piling up).

of course, this comes at the expense of some overhead.

But, before going through all that, how about timing strace recorded
calls (strace -T) both in stall and non-stall conditions. I was
assuming select(), but maybe it's something else....for example the
recently fixed lseek.

merlin


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-18 22:24:36
Message-ID: CAMkU=1zvGGJmrO=Q-Fuup-v7qbw56OR0wCkLaEQLmjqdY2JomA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log

That isn't as much log as I expected. But I guess only the tip of the
iceberg gets logged, it could be that most spinlocks are contended but
then get acquired just before the sleep (and log) would have happened.

I'm not sure how to figure out just how much spinning is going on
below the surface, but my gut feeling is there isn't enough of it to
explain the amount of slowdown you are seeing (and it would probably
be accounted for as user time rather than system time)

Maybe we could whip up a dummy standalone program that uses PG's
spinlock code in a intentionally contended way and see how your
system reports resource usage for that. Maybe your system reports
fighting for cache lines or whatever is going on behind the scenes as
system time rather than user time, or something.

It would be a little easier to figure out what was going on there if
the log included the pid (%p) so we can see if it is the same process
doing a bunch of consecutive sleeps, or a bunch of different
processes. Also, logging the millisecond (%m rather than %t) would be
helpful.

But the way that the point of contention jumps around a lot points to
a scheduler issue rather than a postgres-specific issue. A backend
gets pre-empted while holding a spinlock, and then the kernel is
resistant to rescheduling it even though all other processes are piled
up behind it. But again, that theory would lead to more log lines
than we see, I would think.

Anyway, pgbouncer might help by ensuring that a connection gets
pre-empted by pgbouncer at a point where it has no locks, rather than
by the kernel at a point where it does hold locks.

Any chance you could try an older kernel and see if that reduces the problem?

One thing that does stand out is that it looks like partitioned hash
tables in dynahash.c could benefit by having their freelist
partitioned as well, assuming we can give up the hash_get_num_entries
function.

Cheers,

Jeff


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-19 14:33:22
Message-ID: CAHyXU0wAZgGKM0LTRuW0iaXmpUvFzrYxxf5JugEey6_wpFvNSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>> and lasted until the end
>>
>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> That isn't as much log as I expected. But I guess only the tip of the
> iceberg gets logged, it could be that most spinlocks are contended but
> then get acquired just before the sleep (and log) would have happened.
>
> I'm not sure how to figure out just how much spinning is going on
> below the surface, but my gut feeling is there isn't enough of it to
> explain the amount of slowdown you are seeing (and it would probably
> be accounted for as user time rather than system time)

yes. classic spinlock contention symptoms are high user cpu and almost
work getting done. something else is going on here?

> Maybe we could whip up a dummy standalone program that uses PG's
> spinlock code in a intentionally contended way and see how your
> system reports resource usage for that. Maybe your system reports
> fighting for cache lines or whatever is going on behind the scenes as
> system time rather than user time, or something.

I don't think this is the case. It's looking more and more like
scheduler issues, although it's unclear if that is due to some kind of
regression or just symptoms of backends piling up. I'm also starting
to wonder if the spinlock contention we are seeing (which according to
your patch really isn't all that much really) is just symptomatic of
scheduler issues.

> It would be a little easier to figure out what was going on there if
> the log included the pid (%p) so we can see if it is the same process
> doing a bunch of consecutive sleeps, or a bunch of different
> processes. Also, logging the millisecond (%m rather than %t) would be
> helpful.
>
> But the way that the point of contention jumps around a lot points to
> a scheduler issue rather than a postgres-specific issue. A backend
> gets pre-empted while holding a spinlock, and then the kernel is
> resistant to rescheduling it even though all other processes are piled
> up behind it. But again, that theory would lead to more log lines
> than we see, I would think.

yeah. I asked for some longer strace logs but then everyone went home
for the weekend.

> Anyway, pgbouncer might help by ensuring that a connection gets
> pre-empted by pgbouncer at a point where it has no locks, rather than
> by the kernel at a point where it does hold locks.

yes -- I think pgbouncer is the key here because it's going to tell us
if the scheduler is the culprit here. If you look carefully at the
strace logs, you'll see that system call times become unpredictable as
they are getting scheduled out...particularly yielding calls like
select(). So we need to determine if this is normal high load
behavior or if there is an underlying mechanic.

> Any chance you could try an older kernel and see if that reduces the problem?

This is one thing to try. Personally I think pgbouncer/transaction
mode is low hanging fruit if it 'drops in' to OP's app. If problem
goes away this will pretty much point the finger at Linux AFAICT (in
which case we can start messing around with the kernel).

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-19 16:50:25
Message-ID: CAKeSUqXc_2A4tPrAVJ-uSw1wdbt9_ZQprboLGiWL-xodt5=aZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I just did a little experiment: extracted top four queries that were
executed the longest during stall times and launched pgbench test with 240
clients. Yet I wasn't able to put the server into a stall with that. Also
load average was hitting 120+, it was all user cpu, single digit % system.
The system remained pretty responsive (on a keypress), in contrary to when
high-sys-cpu stall happens, showing similar LA numbers.

This makes me think that there is probably a different (than ones I tried)
query or condition that is responsible for creating high-sys-cpu. Also,
from my limited knowledge of postgresql and kernel internals, this
experiment puts under question idea that linux scheduler might be the
corner stone....

Here is a longer log, first column is PID. High-sys-cpu stall happened
around 10:42:37
https://dl.dropbox.com/u/109778/postgresql-2012-11-19_103425-stripped.log

I'll try to put pgbouncer in place and repeat, also though this will take
longer time to setup.

-- Vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-19 17:19:21
Message-ID: CAHyXU0y8-XeBY9h1snUvnfuASYMa37md5kpOwSj+=g1t6XQOMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Nov 19, 2012 at 10:50 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> I just did a little experiment: extracted top four queries that were
> executed the longest during stall times and launched pgbench test with 240
> clients. Yet I wasn't able to put the server into a stall with that. Also
> load average was hitting 120+, it was all user cpu, single digit % system.
> The system remained pretty responsive (on a keypress), in contrary to when
> high-sys-cpu stall happens, showing similar LA numbers.
>
> This makes me think that there is probably a different (than ones I tried)
> query or condition that is responsible for creating high-sys-cpu. Also, from
> my limited knowledge of postgresql and kernel internals, this experiment
> puts under question idea that linux scheduler might be the corner stone....
>
> Here is a longer log, first column is PID. High-sys-cpu stall happened
> around 10:42:37
> https://dl.dropbox.com/u/109778/postgresql-2012-11-19_103425-stripped.log
>
> I'll try to put pgbouncer in place and repeat, also though this will take
> longer time to setup.

yeah. interesting -- contention was much higher this time and that
changes things. strange how it was missed earlier.

you're getting bounced around a lot in lwlock especially
(unfortunately we don't know which one). I'm going to hazard another
guess: maybe the trigger here is when the number of contending
backends exceeds some critical number (probably based on the number of
cores) you see a quick cpu spike (causing more backends to lock and
pile up) as cache line bouncing sets in. That spike doesn't last
long, because the spinlocks quickly accumulate delay counts then punt
to the scheduler which is unable to cope. The exact reason why this
is happening to you in exactly this way (I've never seen it) is
unclear. Also the line between symptom and cause is difficult to
draw.

unfortunately, in your case spinlock re-scheduling isn't helping. log
entries like this one:
18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file
sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29

are suggesting major problems. you're dangerously close to a stuck
spinlock which is lights out for the database.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-19 18:02:23
Message-ID: CAKeSUqWRSCE3exypfKJj4hBJUB1vSWUdrX_n1aAcveZK3SLJvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Some additional observation and food for thoughts. Our app uses connection
caching (Apache::DBI). By disabling Apache::DBI and forcing
client re-connection for every (http) request processed I eliminated the
stall. The user cpu usage jumped (mostly cause prepared sql queries are no
longer available, and some additional overhead on re-connection), but no
single case of high-sys-cpu stall.

I can not completely rule out the possibility of some left-overs
(unfinished transaction?) remain after serving http request, which, in the
absence of connection caching, are discarded for sure....

-- Vlad

On Mon, Nov 19, 2012 at 11:19 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

>
> yeah. interesting -- contention was much higher this time and that
> changes things. strange how it was missed earlier.
>
> you're getting bounced around a lot in lwlock especially
> (unfortunately we don't know which one). I'm going to hazard another
> guess: maybe the trigger here is when the number of contending
> backends exceeds some critical number (probably based on the number of
> cores) you see a quick cpu spike (causing more backends to lock and
> pile up) as cache line bouncing sets in. That spike doesn't last
> long, because the spinlocks quickly accumulate delay counts then punt
> to the scheduler which is unable to cope. The exact reason why this
> is happening to you in exactly this way (I've never seen it) is
> unclear. Also the line between symptom and cause is difficult to
> draw.
>
> unfortunately, in your case spinlock re-scheduling isn't helping. log
> entries like this one:
> 18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file
> sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29
>
> are suggesting major problems. you're dangerously close to a stuck
> spinlock which is lights out for the database.
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-19 20:04:32
Message-ID: CAHyXU0ysMqA9T6oprOcGOC-Ojaq78_A7Wec9KTTot=htQJ8V_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Nov 19, 2012 at 12:02 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>
> Some additional observation and food for thoughts. Our app uses connection
> caching (Apache::DBI). By disabling Apache::DBI and forcing client
> re-connection for every (http) request processed I eliminated the stall. The
> user cpu usage jumped (mostly cause prepared sql queries are no longer
> available, and some additional overhead on re-connection), but no single
> case of high-sys-cpu stall.
>
> I can not completely rule out the possibility of some left-overs (unfinished
> transaction?) remain after serving http request, which, in the absence of
> connection caching, are discarded for sure....

well, that is good. note for pgbouncer transaction mode you would
have had to quit using prepared statements anyways (see list of
features here: http://wiki.postgresql.org/wiki/PgBouncer). you can
use session mode though, and that would be a good test. faster speed
could be because of application issues or maybe just spreading out the
queries reduced contention. hard to say.

well, what happens next is up to you -- we can spend more time chasing
this down if you want, or you can hold back and see how things run for
a while. what is your user load up to in the worst case?

merlin


From: Vlad Marchenko <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 01:23:24
Message-ID: 50AADB8C.10607@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

OK, so far I settled on excluding connection caching on app side
(Apache::DBI and prepare_cached) from equation and adding pgbouncer as a
counter-measure. This seems to stabilize the situation - at least I'm
not able to push server into high-sys-cpu stall the way how I used to do.

I'm still interested in getting to the bottom of the issue vs
alleviating it with connection poller, so I might setup a server on
Centos 6.3 (kernel ver 2.6.32) and give it a try.

Alternatively, if anyone have ideas on playing with sysctl parameters
related to scheduler - please let me know.

-- vlad


From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 15:02:23
Message-ID: 50AB9B7F.1010306@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/16/2012 02:31 PM, Merlin Moncure wrote:

> no single thing really stands out -- contention is all over the place.
> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
> bad scheduler interaction. any chance we can fire up pgbouncer?

Just want to throw it out there, but we've been having really bad luck
with the scheduler recently. But only when we use 8GB (on our 72GB
system) for shared_buffers. Cut that down to 4GB, and everything is fine
and dandy.

I think the kernel devs have added in some overzealous scheduler code on us.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: sthomas(at)optionshouse(dot)com
Cc: Vlad <marchenko(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:03:18
Message-ID: CAHyXU0yYpfMQFNSO9-4qAZUL=AH=m+xZ8Ukk+NZ_KcsDvpXM7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>
>> no single thing really stands out -- contention is all over the place.
>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
>> bad scheduler interaction. any chance we can fire up pgbouncer?
>
>
> Just want to throw it out there, but we've been having really bad luck with
> the scheduler recently. But only when we use 8GB (on our 72GB system) for
> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>
> I think the kernel devs have added in some overzealous scheduler code on us.

Shared buffer manipulation changing contention is suggesting you're
running into free list lock issues. How many active backends/cores?

merlin


From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:12:42
Message-ID: 50ABABFA.9070509@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/20/2012 10:03 AM, Merlin Moncure wrote:

> Shared buffer manipulation changing contention is suggesting you're
> running into free list lock issues. How many active backends/cores?

Oh, the reason I wanted to point it out was that we see a lot more than
CPU contention with higher shared_buffer settings. The scheduler starts
aggressively clearing out the file cache *and* freeing memory, driving
load way higher than it should be. That doesn't result in a very high
iowait, but all that page swapping sure isn't free.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: sthomas(at)optionshouse(dot)com
Cc: Vlad <marchenko(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:13:57
Message-ID: CAHyXU0zBGbUKnnE7KPSmNXinZZ47aL1yogOe1uCUduWjkzB06g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 10:12 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 11/20/2012 10:03 AM, Merlin Moncure wrote:
>
>> Shared buffer manipulation changing contention is suggesting you're
>> running into free list lock issues. How many active backends/cores?
>
>
> Oh, the reason I wanted to point it out was that we see a lot more than CPU
> contention with higher shared_buffer settings. The scheduler starts
> aggressively clearing out the file cache *and* freeing memory, driving load
> way higher than it should be. That doesn't result in a very high iowait, but
> all that page swapping sure isn't free.

have you ruled out numa issues?
(http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

merlin


From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:21:45
Message-ID: 50ABAE19.4050308@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/20/2012 10:13 AM, Merlin Moncure wrote:

> have you ruled out numa issues?
> (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

Haha. Yeah. Our zone reclaim mode off, and node distance is 10 or 20.
ZCM is only enabled by default if distance is > 20, unless there's some
kernel bug triggering it even when zone reclaim reports as off.

I'll also note that my tests with interleave made no difference at all.
At least not with pgbench. There was a small amount of flux with TPS,
but the page swap storms came regardless of NUMA tweaks. The formula
worked like this:

High connection count + high shared_buffers = page swap storm. I'll note
that 8GB -> 4GB immediately stopped the paging everywhere, and the OS
went from using 13GB for active file cache, to 45GB. I can't see how PG
would cause something like that by itself.

I only piped in because it's very adversely affecting our CPU load in a
similarly inexplicable-but-seemingly-the-scheduler way.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:50:48
Message-ID: CAMkU=1z78o4+vXH-dFKb3GZ+1mfEeqQrWBJh-Jpuk03Ha7nOHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>
>>> no single thing really stands out -- contention is all over the place.
>>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
>>> bad scheduler interaction. any chance we can fire up pgbouncer?
>>
>>
>> Just want to throw it out there, but we've been having really bad luck with
>> the scheduler recently. But only when we use 8GB (on our 72GB system) for
>> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>>
>> I think the kernel devs have added in some overzealous scheduler code on us.
>
> Shared buffer manipulation changing contention is suggesting you're
> running into free list lock issues.

I wouldn't expect so. Increasing shared_buffers should either fix
free list lock contention, or leave it unchanged, not make it worse.
Sounds more like that NUMA/interleave/zone reclaim stuff I could never
wrap my head around.

Cheers,

Jeff


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Vlad <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 16:53:08
Message-ID: CAHyXU0x3La+g_TP9q6Bs6MDT9as0eYQT9+7fVY22oWWWbxjWdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Merlin Moncure escribió:
>
>> ok, excellent. reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
>> 64, 0, NULL, NULL) = 64
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
>> semop(41713721, {{2, 1, 0}}, 1) = 0
>> lseek(295, 0, SEEK_END) = 0
>> lseek(296, 0, SEEK_END) = 8192
>>
>> this is definitely pointing to spinlock issue.
>
> I met Rik van Riel (Linux kernel hacker) recently and we chatted about
> this briefly. He strongly suggested that we should consider using
> futexes on Linux instead of spinlocks; the big advantage being that
> futexes sleep instead of spinning when contention is high. That would
> reduce the system load in this scenario.

By the way, I think you (and Rik) 100% right. We're slamming select()
where futex has much smarter wait mechanism. There are reports from
users who have switched to pthreads and reporting better scalability.
(https://groups.google.com/forum/?fromgroups=#!topic/pgsql.hackers/v8V82gznxs8)

merlin


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 17:05:13
Message-ID: CAHyXU0zm3sF3xRguH5Rh=r1YnAQD6uaY+aCfi8HFx7C0in8guQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
>>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>>
>>>> no single thing really stands out -- contention is all over the place.
>>>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
>>>> bad scheduler interaction. any chance we can fire up pgbouncer?
>>>
>>>
>>> Just want to throw it out there, but we've been having really bad luck with
>>> the scheduler recently. But only when we use 8GB (on our 72GB system) for
>>> shared_buffers. Cut that down to 4GB, and everything is fine and dandy.
>>>
>>> I think the kernel devs have added in some overzealous scheduler code on us.
>>
>> Shared buffer manipulation changing contention is suggesting you're
>> running into free list lock issues.
>
> I wouldn't expect so. Increasing shared_buffers should either fix
> free list lock contention, or leave it unchanged, not make it worse.

AIUI, that is simply not true (unless you raise it to the point you're
not churning them). I'm looking at StrategyGetBuffer() for non-scan
cases. It locks "BufFreelistLock" then loops the free list, and, if
it finds nothing, engages a clock sweep. Both of those operations are
dependent on the number of buffers being managed and so it's
reasonable to expect some workloads to increase contention with more
buffers.

merlin


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 18:16:48
Message-ID: CAMkU=1zn-fW=8pOuMC8CSbtTaOmoB8+FZ2LjtoMcaDuRpBwS1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>> I wouldn't expect so. Increasing shared_buffers should either fix
>> free list lock contention, or leave it unchanged, not make it worse.
>
> AIUI, that is simply not true (unless you raise it to the point you're
> not churning them). I'm looking at StrategyGetBuffer() for non-scan
> cases. It locks "BufFreelistLock" then loops the free list, and, if
> it finds nothing, engages a clock sweep.

The freelist should never loop. It is written as a loop, but I think
there is currently no code path which ends up with valid buffers being
on the freelist, so that loop will never, or at least rarely, execute
more than once.

> Both of those operations are
> dependent on the number of buffers being managed and so it's
> reasonable to expect some workloads to increase contention with more
> buffers.

The clock sweep can depend on the number of buffers begin managed in a
worst-case sense, but I've never seen any evidence (nor analysis) that
this worst case can be obtained in reality on an ongoing basis. By
constructing two pathological workloads which get switched between, I
can get the worst-case to happen, but when it does happen the
consequences are mild compared to the amount of time needed to set up
the necessary transition. In other words, the worse-case can't be
triggered often enough to make a meaningful difference.

Cheers,

Jeff


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 20:00:59
Message-ID: CAHyXU0xx+ej155ZbdgRR8hahcd5PQOc5OxUex9=x3tTuRPqvbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>
>>> I wouldn't expect so. Increasing shared_buffers should either fix
>>> free list lock contention, or leave it unchanged, not make it worse.
>>
>> AIUI, that is simply not true (unless you raise it to the point you're
>> not churning them). I'm looking at StrategyGetBuffer() for non-scan
>> cases. It locks "BufFreelistLock" then loops the free list, and, if
>> it finds nothing, engages a clock sweep.
>
> The freelist should never loop. It is written as a loop, but I think
> there is currently no code path which ends up with valid buffers being
> on the freelist, so that loop will never, or at least rarely, execute
> more than once.
>
>> Both of those operations are
>> dependent on the number of buffers being managed and so it's
>> reasonable to expect some workloads to increase contention with more
>> buffers.
>
> The clock sweep can depend on the number of buffers begin managed in a
> worst-case sense, but I've never seen any evidence (nor analysis) that
> this worst case can be obtained in reality on an ongoing basis. By
> constructing two pathological workloads which get switched between, I
> can get the worst-case to happen, but when it does happen the
> consequences are mild compared to the amount of time needed to set up
> the necessary transition. In other words, the worse-case can't be
> triggered often enough to make a meaningful difference.

Yeah, good points; but (getting off topic here) : there have been
several documented cases of lowering shared buffers improving
performance under contention...the 'worst case' might be happening
more than expected. In particular, what happens when a substantial
percentage of the buffer pool is set with a non-zero usage count?
This seems unlikely, but possible? Take note:

if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
trycounter = NBuffers; /* emphasis *./
}

ISTM time spent here isn't bounded except that as more time is spent
sweeping (more backends are thus waiting and not marking pages) the
usage counts decrease faster until you hit steady state. Smaller
buffer pool naturally would help in that scenario as your usage counts
would drop faster. It strikes me as cavalier to be resetting
trycounter while sitting under the #1 known contention point for read
only workloads. Shouldn't SBF() work on advisory basis and try to
force a buffer after N failed usage count attempts?

merlin


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 22:08:17
Message-ID: CAMkU=1zvjECKSa18rHUQ1RJ98zRwk_xeAtTuwO8Ak0YaeB1B=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>> The freelist should never loop. It is written as a loop, but I think
>> there is currently no code path which ends up with valid buffers being
>> on the freelist, so that loop will never, or at least rarely, execute
>> more than once.
>>
>>> Both of those operations are
>>> dependent on the number of buffers being managed and so it's
>>> reasonable to expect some workloads to increase contention with more
>>> buffers.
>>
>> The clock sweep can depend on the number of buffers begin managed in a
>> worst-case sense, but I've never seen any evidence (nor analysis) that
>> this worst case can be obtained in reality on an ongoing basis. By
>> constructing two pathological workloads which get switched between, I
>> can get the worst-case to happen, but when it does happen the
>> consequences are mild compared to the amount of time needed to set up
>> the necessary transition. In other words, the worse-case can't be
>> triggered often enough to make a meaningful difference.
>
> Yeah, good points; but (getting off topic here) : there have been
> several documented cases of lowering shared buffers improving
> performance under contention...the 'worst case' might be happening
> more than expected.

The ones that I am aware of (mostly Greg Smith's case studies) this
has been for write-intensive work loads and are related to
writes/fsyncs getting gummed up.

Shaun Thomas reports one that is (I assume) not read intensive, but
his diagnosis is that this is a kernel bug where a larger
shared_buffers for no good reason causes the kernel to kill off its
page cache. From the kernel's perspective, the freelist lock doesn't
look any different from any other lwlock, so I doubt that is issue is
related to freelist lock.

> In particular, what happens when a substantial
> percentage of the buffer pool is set with a non-zero usage count?

The current clock sweep algorithm is an extraordinary usagecount
decrementing machine. From what I know, the only way to get much more
than half of the buffers to be non-zero usage count is for the
clock-sweep to rarely run (in which case, it is hard to be the
bottleneck if it rarely runs), or for most of the buffer-cache to be
pinned simultaneously.

> This seems unlikely, but possible? Take note:
>
> if (buf->refcount == 0)
> {
> if (buf->usage_count > 0)
> {
> buf->usage_count--;
> trycounter = NBuffers; /* emphasis *./
> }
>
> ISTM time spent here isn't bounded except that as more time is spent
> sweeping (more backends are thus waiting and not marking pages) the
> usage counts decrease faster until you hit steady state.

But that is a one time thing. Once you hit the steady state, how do
you get away from it again, such that a large amount of work is needed
again?

> Smaller
> buffer pool naturally would help in that scenario as your usage counts
> would drop faster.

They would drop at the same rate in absolute numbers, barring the
smaller buffer_cache fitting entirely in the on-board CPU cache.

They would drop faster in percentage terms, but they would also
increase faster in percentage terms once a candidate is found and a
new page read into it.

> It strikes me as cavalier to be resetting
> trycounter while sitting under the #1 known contention point for read
> only workloads.

The only use for the trycounter is to know when to ERROR out with "no
unpinned buffers available", so not resetting that seems entirely
wrong.

I would contest "the #1 known contention point" claim. We know that
the freelist lock is a point of contention under certain conditions,
but we (or at least I) also know that it is the mere acquisition of
this lock, and not the work done while it is held, that is important.

If I add a spurious
"LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); LWLockRelease(BufFreelistLock);"
to each execution of StrategyGetBuffer, then contention kicks in twice as fast.

But if I instead hack the clock sweep to run twice as far (ignore the
first eligible buffer it finds, and go find another one) but all under
the cover of a single BufFreelistLock acquisition, there was no
meaningful increase in contention.

This was all on a 4 socket x 2 core/socket opteron machine which I no
longer have access to. Using a more modern 8 core on a single socket,
I can't get it to collapse on BufFreelistLock at all, presumably
because the cache coherence mechanisms are so much faster.

> Shouldn't SBF() work on advisory basis and try to
> force a buffer after N failed usage count attempts?

I believe Simon tried that a couple commit-fests ago, and no one could
show that it made a difference.

Cheers,

Jeff


From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 22:26:49
Message-ID: 50AC03A9.70604@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/20/2012 04:08 PM, Jeff Janes wrote:

> Shaun Thomas reports one that is (I assume) not read intensive, but
> his diagnosis is that this is a kernel bug where a larger
> shared_buffers for no good reason causes the kernel to kill off its
> page cache.

We're actually very read intensive. According to pg_stat_statements, we
regularly top out at 42k queries per second, and pg_stat_database says
we're pushing 7k TPS.

But I'm still sure this is a kernel bug. Moving from 4GB to 6GB or 8GB
causes the kernel to cut the active page cache in half, in addition to
freeing 1/4 of RAM to just sit around doing nothing. That in turn causes
kswapd to work constantly, while our IO drivers work to undo the damage.
It's a positive feedback loop that I can reliably drive the load up to
800+ on an 800-client pgbench with two threads, all while having 0% CPU
free.

Make that 4GB, and not only does the problem completely disappear, but
the load settles down to around 9, and the machine becomes about 60%
idle. Something in there is fantastically broken, but I can't point a
finger at what.

I was just piping in because, in absence of an obvious PG-related
culprit, the problem could be the OS itself. It certainly was in our case.

That, or PG has a memory leak that only appears at > 4GB of shared buffers.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: sthomas(at)optionshouse(dot)com
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 22:35:46
Message-ID: CAMkU=1xzTDgaswki6Ymp=8KXhhzcj7y1vk4OPew3bhALdieQNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 2:26 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 11/20/2012 04:08 PM, Jeff Janes wrote:
>
>> Shaun Thomas reports one that is (I assume) not read intensive, but
>> his diagnosis is that this is a kernel bug where a larger
>> shared_buffers for no good reason causes the kernel to kill off its
>> page cache.
>
>
> We're actually very read intensive.

Sorry, that is what I meant. I wrote "not write intensive", then
decided it would be clearer to change "not write" to "read" and then
forgot to remove the "not". I hate it when that happens.

Atomic update commit failure in the meatware :)

Cheers,

Jeff


From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 22:41:21
Message-ID: 50AC0711.6090201@optionshouse.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 11/20/2012 04:35 PM, Jeff Janes wrote:

> Atomic update commit failure in the meatware :)

Ha.

What's actually funny is that one of the affected machines started
*swapping* earlier today. With 15GB free, and 12GB of inactive cache,
and vm.swappiness set to 0, it somehow decided there was enough memory
pressure to push 1.5GB of *something* into swap.

Frankly, the kernel we're using is clearly deranged. There's really no
other way to explain this. But we were blaming various elements of PG
for several weeks following our upgrade. After further examination, our
query profile isn't any different, and the load spikes were effectively
random.

All I can say is... be wary of the 3.2 kernel. It's... unique.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: sthomas(at)optionshouse(dot)com, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 22:52:15
Message-ID: CAHyXU0yBshNe1OzVEDVNsMS93fsFQzAautirQ-wE_omtB0RH5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Nov 20, 2012 at 4:08 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> It strikes me as cavalier to be resetting
>> trycounter while sitting under the #1 known contention point for read
>> only workloads.
>
> The only use for the trycounter is to know when to ERROR out with "no
> unpinned buffers available", so not resetting that seems entirely
> wrong.
>
> I would contest "the #1 known contention point" claim. We know that
> the freelist lock is a point of contention under certain conditions,
> but we (or at least I) also know that it is the mere acquisition of
> this lock, and not the work done while it is held, that is important.

Well, I'm speculating (without evidence) if that is in fact might not
be always true. It certainly is hard to demonstrate with synthetic
testing though. I guess the first step would be to whip up a
diagnostic patch to confirm malfeasance in suspicious cases.

merlin


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: sthomas(at)optionshouse(dot)com
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Vlad <marchenko(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-20 23:06:48
Message-ID: 50AC0D08.4030801@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 21/11/12 11:41, Shaun Thomas wrote:
> On 11/20/2012 04:35 PM, Jeff Janes wrote:
>
>> Atomic update commit failure in the meatware :)
>
> Ha.
>
> What's actually funny is that one of the affected machines started
> *swapping* earlier today. With 15GB free, and 12GB of inactive cache,
> and vm.swappiness set to 0, it somehow decided there was enough memory
> pressure to push 1.5GB of *something* into swap.
>
> Frankly, the kernel we're using is clearly deranged. There's really no
> other way to explain this. But we were blaming various elements of PG
> for several weeks following our upgrade. After further examination,
> our query profile isn't any different, and the load spikes were
> effectively random.
>
> All I can say is... be wary of the 3.2 kernel. It's... unique.
>
You'll just have to upgrade to Microsoft!
<Ducks, and runs away _VERY_ quickly!>

Could you not install a later kernel?

Cheers,
Gavin


From: Vlad Marchenko <marchenko(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 15:29:01
Message-ID: 50ACF33D.20609@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

update on my problem: despite pgbouncer, the problem still occures on my
end.

Also, interesting observation - I ran several tests with pgbench, using
queries that I think are prone to trigger high-sys-cpu-stall. What I
noticed is when pgbench is started with prepared mode, the system
behaves fine during stress-test (high user cpu - 85-90%, low sys cpu -
5-7%), high TPS. Though when I used non-prepared modes, the sys cpu
portion jumps to 40% (and tps drops dramatically as well, but this is
understandable). The test queries are pretty long (10kb+), with couple
of outer joins across 1000-record tables with indexes.

Maybe, we are looking in a wrong place and the issue is somewhere within
planer/parser? Is there some extensive locking used in there?

Another observation - it's harder to trigger high-sys-cpu stall on a
freshly restarted postgresql. Though if it was running for a while, then
it's much easier to do.

-- vlad

On 11/19/12 8:33 AM, Merlin Moncure wrote:
> On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>>> and lasted until the end
>>>
>>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>> That isn't as much log as I expected. But I guess only the tip of the
>> iceberg gets logged, it could be that most spinlocks are contended but
>> then get acquired just before the sleep (and log) would have happened.
>>
>> I'm not sure how to figure out just how much spinning is going on
>> below the surface, but my gut feeling is there isn't enough of it to
>> explain the amount of slowdown you are seeing (and it would probably
>> be accounted for as user time rather than system time)
>
> yes. classic spinlock contention symptoms are high user cpu and almost
> work getting done. something else is going on here?
>
>> Maybe we could whip up a dummy standalone program that uses PG's
>> spinlock code in a intentionally contended way and see how your
>> system reports resource usage for that. Maybe your system reports
>> fighting for cache lines or whatever is going on behind the scenes as
>> system time rather than user time, or something.
> I don't think this is the case. It's looking more and more like
> scheduler issues, although it's unclear if that is due to some kind of
> regression or just symptoms of backends piling up. I'm also starting
> to wonder if the spinlock contention we are seeing (which according to
> your patch really isn't all that much really) is just symptomatic of
> scheduler issues.
>
>> It would be a little easier to figure out what was going on there if
>> the log included the pid (%p) so we can see if it is the same process
>> doing a bunch of consecutive sleeps, or a bunch of different
>> processes. Also, logging the millisecond (%m rather than %t) would be
>> helpful.
>>
>> But the way that the point of contention jumps around a lot points to
>> a scheduler issue rather than a postgres-specific issue. A backend
>> gets pre-empted while holding a spinlock, and then the kernel is
>> resistant to rescheduling it even though all other processes are piled
>> up behind it. But again, that theory would lead to more log lines
>> than we see, I would think.
> yeah. I asked for some longer strace logs but then everyone went home
> for the weekend.
>
>> Anyway, pgbouncer might help by ensuring that a connection gets
>> pre-empted by pgbouncer at a point where it has no locks, rather than
>> by the kernel at a point where it does hold locks.
> yes -- I think pgbouncer is the key here because it's going to tell us
> if the scheduler is the culprit here. If you look carefully at the
> strace logs, you'll see that system call times become unpredictable as
> they are getting scheduled out...particularly yielding calls like
> select(). So we need to determine if this is normal high load
> behavior or if there is an underlying mechanic.
>
>> Any chance you could try an older kernel and see if that reduces the problem?
> This is one thing to try. Personally I think pgbouncer/transaction
> mode is low hanging fruit if it 'drops in' to OP's app. If problem
> goes away this will pretty much point the finger at Linux AFAICT (in
> which case we can start messing around with the kernel).
>
> merlin


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad Marchenko <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 15:49:12
Message-ID: CAHyXU0xxp_sKXOzVwWGiQYuatO0Dxp5PwE2ghVz6gLXLtf+g6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 9:29 AM, Vlad Marchenko <marchenko(at)gmail(dot)com> wrote:
> update on my problem: despite pgbouncer, the problem still occures on my
> end.
>
> Also, interesting observation - I ran several tests with pgbench, using
> queries that I think are prone to trigger high-sys-cpu-stall. What I noticed
> is when pgbench is started with prepared mode, the system behaves fine
> during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS.
> Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and
> tps drops dramatically as well, but this is understandable). The test
> queries are pretty long (10kb+), with couple of outer joins across
> 1000-record tables with indexes.
>
> Maybe, we are looking in a wrong place and the issue is somewhere within
> planer/parser? Is there some extensive locking used in there?
>
> Another observation - it's harder to trigger high-sys-cpu stall on a freshly
> restarted postgresql. Though if it was running for a while, then it's much
> easier to do.

what pgbouncer mode, and how large is your pool.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 16:14:11
Message-ID: CAKeSUqVnRJ6bZoMqBSY=q3gax9dj4Zw7drRghTJFChm-NT6VJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> what pgbouncer mode, and how large is your pool.
>
>
'-M prepared' produces normal results, while '-M simple' results in 40% sys
cpu. '-M extended' is somewhere in between.
I'm running it as 60 clients, 2 threads.

-- Vlad


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 16:18:44
Message-ID: CAMkU=1x2JKb3ZW5JRn96uNECiqdW8CYCZA86UznXrSep__oMDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
>
>
> '-M prepared' produces normal results, while '-M simple' results in 40% sys
> cpu. '-M extended' is somewhere in between.
> I'm running it as 60 clients, 2 threads.

2 threads is pretty low for 60 clients. What happens if you increase
-j to either 30 or 60? (What I'm wondering here is if perhaps in this
case it is pgbench which is triggering the scheduling issue)

Also, what happens if you channel pgbench through pgbouncer?

Cheers,

Jeff


From: Vlad <marchenko(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 16:34:56
Message-ID: CAKeSUqV4V5RXHz3B-EthcresrxLtQzhAyTPobOcCPdP6k6MEhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

nothing changes if I increase number of threads.
pgbouncer doesn't change much.

also, I think the nature of high-sys-cpu during stall and and when I run
pgbench is different.
During pgbench it's constantly at 30-40%, while during stall it sits at low
5-15% and then spikes to 90% after a while, with no apparent reason (i.e.:
no increased load or anything), then goes down in a few secs, or stays
until postgres restarted.

at this point the only thing that does help on app side is when I force
frequent re-connection to postgresql (i.e. not letting postmaster live too
long)

-- Vlad

On Wed, Nov 21, 2012 at 10:18 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Wed, Nov 21, 2012 at 8:14 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> >
> >
> > '-M prepared' produces normal results, while '-M simple' results in 40%
> sys
> > cpu. '-M extended' is somewhere in between.
> > I'm running it as 60 clients, 2 threads.
>
> 2 threads is pretty low for 60 clients. What happens if you increase
> -j to either 30 or 60? (What I'm wondering here is if perhaps in this
> case it is pgbench which is triggering the scheduling issue)
>
> Also, what happens if you channel pgbench through pgbouncer?
>
> Cheers,
>
> Jeff
>


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vlad Marchenko <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, Merlin Moncure <mmoncure(at)gmail(dot)com>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 16:43:55
Message-ID: CAMkU=1zso-4cZS4-ZJ+FVac6Z2Hr05Qp4uMWh-rQj1SL=TJFOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko(at)gmail(dot)com> wrote:

> update on my problem: despite pgbouncer, the problem still occures on my
> end.

As Merlin asked, how big is the pool? Maybe you are using a large
enough pool so as to defeat the purpose of restricting the number of
connections.

> Also, interesting observation - I ran several tests with pgbench, using
> queries that I think are prone to trigger high-sys-cpu-stall. What I noticed
> is when pgbench is started with prepared mode, the system behaves fine
> during stress-test (high user cpu - 85-90%, low sys cpu - 5-7%), high TPS.
> Though when I used non-prepared modes, the sys cpu portion jumps to 40% (and
> tps drops dramatically as well, but this is understandable). The test
> queries are pretty long (10kb+), with couple of outer joins across
> 1000-record tables with indexes.

Could you sanitize the queries (and some statements to generate dummy
data) enough to share?

>
> Maybe, we are looking in a wrong place and the issue is somewhere within
> planer/parser? Is there some extensive locking used in there?

I don't think the locking is particular extensive, but it could be
enough extra to drive something over the edge.

But it would be the same nature of locking as elsewhere (spinlocks and
lwlocks), so it doesn't really change the nature of the problem, which
is still "Why do these user-space locks turn into high SYS cpu?"

> Another observation - it's harder to trigger high-sys-cpu stall on a freshly
> restarted postgresql. Though if it was running for a while, then it's much
> easier to do.

Maybe the long running time has built up enough resource usage to
cause the kernel scheduler to get into a snit, so it decides to
preempt the process while it holds a spinlock, and then refuses to run
it again for a while.

Cheers,

Jeff


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Vlad Marchenko <marchenko(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 16:45:48
Message-ID: CAHyXU0xuuKsnBkQjKP7Qw4Kc8QzYsCnkwr3ER9Esp0GNgzaR=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko(at)gmail(dot)com> wrote:
>
>> update on my problem: despite pgbouncer, the problem still occures on my
>> end.
>
> As Merlin asked, how big is the pool? Maybe you are using a large
> enough pool so as to defeat the purpose of restricting the number of
> connections.

also, what mode.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 17:05:05
Message-ID: CAKeSUqWdscuKRwX-ZeK=LQQ-EudPdZLCGqAB6LV4A7DuQjG-UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

it's session mode and the pool size is 1200 (cause I need to grantee that
in the worst case we have enough slots for all possible clients), however
even at the times preceding high-cpu-sys-stall, the number postmasters are
like 15-20. When stall happens, it starts to raise but that's the result of
DB not answering in timely manner.

as far as data - yes, I can probably clean up names and provide some data
with same queries, though - we are not able to cause a stall just with
those queries & pgbench; we are able to make the situation worse and stall
much likely when it's overlapped with regular queries coming from our app.

-- Vlad

On Wed, Nov 21, 2012 at 10:45 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko <marchenko(at)gmail(dot)com>
> wrote:
> >
> >> update on my problem: despite pgbouncer, the problem still occures on my
> >> end.
> >
> > As Merlin asked, how big is the pool? Maybe you are using a large
> > enough pool so as to defeat the purpose of restricting the number of
> > connections.
>
> also, what mode.
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 17:09:32
Message-ID: CAHyXU0xxEd+N26pc+vYiV5sTk_56bF=m3xfLOF+KsJA4su2f-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 11:05 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients), however even
> at the times preceding high-cpu-sys-stall, the number postmasters are like
> 15-20. When stall happens, it starts to raise but that's the result of DB
> not answering in timely manner.

That is completely defeating the point. Mode should be transaction
mode and somewhere between 1-2x the number of cores for the pool size.
Note the # of connected clients does not have to equal the size of
the pool -- that's the whole point of pgboucner -- it forces the
queries through a fixed number of backend processes.

merlin


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 17:18:46
Message-ID: CAMkU=1yMz6Q5NASGg+trOy_r5yj27Qk2o46QjUvrKHw8k5G0KA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 9:05 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients),

Wouldn't the clients prefer to wait 100ms to get a connnection if that
means their query finishes in 100ms,
rather than waiting 0ms to get a connection but have the query take
1000ms to finish?

> however even
> at the times preceding high-cpu-sys-stall, the number postmasters are like
> 15-20. When stall happens, it starts to raise but that's the result of DB
> not answering in timely manner.

That is like the "beatings will continue until moral improves" joke.
The theory about why pgbouncer would be beneficial is that it prevents
hundreds of connections from clogging up the server. By allowing
hundreds of connections to be made when the server gets into distress,
you are increasing that distress and making it harder to recover from
it. That is the theory, anyway.

Cheers,

Jeff


From: Vlad <marchenko(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 17:56:36
Message-ID: CAKeSUqX4yxbJ1NCK4DDR-G6mF1FvUWf8q742dP__osmB3uU7vg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

ok, understood.
I need to give some more thoughts to if it's possible for us to switch to
transaction mode from app standpoint of view.

if yes, then setting pool size to 20 (for 8 cores) sounds OK?

-- Vlad


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 18:04:06
Message-ID: CAHyXU0wqdp1q8WK5k4s6nxTLUBZaLYtcNUrM40hWrqSzNMfupQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> ok, understood.
> I need to give some more thoughts to if it's possible for us to switch to
> transaction mode from app standpoint of view.
>
> if yes, then setting pool size to 20 (for 8 cores) sounds OK?

If it was me, I would be starting with exactly 8 and be tuning from there.

Don't forget: each database role gets its own pool.

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 18:17:08
Message-ID: CAKeSUqW6o5ZZfxh1C8qSvNKYQu4tNC3Zfvh_7LQpSdKCdihqvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

It turned out we can't use transaction mode, cause there are prepared
statement used a lot within code, while processing a single http request.

Also, I can't 100% rule out that there won't be any long running
(statistical) queries launched (even though such requests should not come
to this database), which would occupy connection for longer time, but do
not create any race condition... So having pool size at 8 may be too slim .

It's a single database.

-- Vlad

On Wed, Nov 21, 2012 at 12:04 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Wed, Nov 21, 2012 at 11:56 AM, Vlad <marchenko(at)gmail(dot)com> wrote:
> > ok, understood.
> > I need to give some more thoughts to if it's possible for us to switch to
> > transaction mode from app standpoint of view.
> >
> > if yes, then setting pool size to 20 (for 8 cores) sounds OK?
>
> If it was me, I would be starting with exactly 8 and be tuning from there.
>
> Don't forget: each database role gets its own pool.
>
> merlin
>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 20:17:35
Message-ID: CAHyXU0wWu9Ku0_Cqnzef-AG3=_kWQqbPTSZQdgdaVgMYGiEHmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> It turned out we can't use transaction mode, cause there are prepared
> statement used a lot within code, while processing a single http request.

prepare statements can be fudged within some constraints. if prepared
statements are explicitly named via PREPARE, you can simply prepare
them all on server connection via connect_query setting and disable
the manual preparation. you then change the server_reset_query so
that they are not discarded. some basic experimentation might confirm
if this is viable strategy. automatic protocol level statements can
be an issue though.

> Also, I can't 100% rule out that there won't be any long running
> (statistical) queries launched (even though such requests should not come to
> this database), which would occupy connection for longer time, but do not
> create any race condition... So having pool size at 8 may be too slim .

there are a number of simple tricks to deal with this:
1) move long running queries to their own pool (by changing login user
or connection string)
2) bypass pgbouncer in those cases
3) increase pool size

merlin


From: Vlad <marchenko(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 20:41:36
Message-ID: CAKeSUqVZ4BPLqPuF3xYuAQaZPPC_X4VUCQpADsguq8+TYC_JwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Merlin,

On Wed, Nov 21, 2012 at 2:17 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Wed, Nov 21, 2012 at 12:17 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
> > It turned out we can't use transaction mode, cause there are prepared
> > statement used a lot within code, while processing a single http request.
>
> prepare statements can be fudged within some constraints. if prepared
> statements are explicitly named via PREPARE, you can simply prepare
> them all on server connection via connect_query setting and disable
> the manual preparation. you then change the server_reset_query so
> that they are not discarded. some basic experimentation might confirm
> if this is viable strategy. automatic protocol level statements can
> be an issue though.
>

We have 350k+ lines of code in our app, so this is not quite feasible as
I'd wish.

> > Also, I can't 100% rule out that there won't be any long running
> > (statistical) queries launched (even though such requests should not
> come to
> > this database), which would occupy connection for longer time, but do not
> > create any race condition... So having pool size at 8 may be too slim .
>
> there are a number of simple tricks to deal with this:
> 1) move long running queries to their own pool (by changing login user
> or connection string)
> 2) bypass pgbouncer in those cases
> 3) increase pool size
>
>
It's pretty much already setup so that long running queries should not hit
the same DB cluster as those with (potentially) high connection/query
rates, but I still can't rule out that no long-running queries will be
issued via pgbouncer.

Either case - it seems that the combination of pool size = 200 / pool mode
= session / server_lifetime = 30 makes things stable for now.

I'm planning to repeat my case on 2.6.x kernel, but it will be a while
before I have chance to do that.

Thanks.

-- vlad