Re: spinlocks storm bug

Lists: pgsql-hackers
From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: spinlocks storm bug
Date: 2013-12-06 06:22:27
Message-ID: CAFj8pRDDa40eiP4UTrCm=+Bdt0xbWF7qC8T_3y0dFqYuZk2YAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I have a report of critical bug (database is temporary unavailability ..
restart is necessary).

A customer use:

PostgreSQL 9.2.4,
24 CPU
140G RAM
SSD disc for all

Database is under high load. There is a few databases with very high number
of similar simple statements. When application produce higher load, then
number of active connection is increased to 300-600 about.

In some moment starts described event - there is a minimal IO, all CPU are
on 100%.

Perf result shows:

354246.00 93.0% s_lock
/usr/lib/postgresql/9.2/bin/postgres
10503.00 2.8% LWLockRelease
/usr/lib/postgresql/9.2/bin/postgres
8802.00 2.3% LWLockAcquire
/usr/lib/postgresql/9.2/bin/postgres
828.00 0.2% _raw_spin_lock
[kernel.kallsyms]
559.00 0.1% _raw_spin_lock_irqsave
[kernel.kallsyms]
340.00 0.1% switch_mm
[kernel.kallsyms]
305.00 0.1% poll_schedule_timeout
[kernel.kallsyms]
274.00 0.1% native_write_msr_safe
[kernel.kallsyms]
257.00 0.1% _raw_spin_lock_irq
[kernel.kallsyms]
238.00 0.1% apic_timer_interrupt
[kernel.kallsyms]
236.00 0.1% __schedule
[kernel.kallsyms]
213.00 0.1% HeapTupleSatisfiesMVCC

We try to limit a connection to 300, but I am not sure if this issue is not
related to some Postgres bug.

Regards

Pavel


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: spinlocks storm bug
Date: 2013-12-06 09:56:29
Message-ID: 20131206095629.GI7814@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote:
> I have a report of critical bug (database is temporary unavailability ..
> restart is necessary).

> PostgreSQL 9.2.4,
> 24 CPU
> 140G RAM
> SSD disc for all
>
>
> Database is under high load. There is a few databases with very high number
> of similar simple statements. When application produce higher load, then
> number of active connection is increased to 300-600 about.
>
> In some moment starts described event - there is a minimal IO, all CPU are
> on 100%.
>
> Perf result shows:
> 354246.00 93.0% s_lock
> /usr/lib/postgresql/9.2/bin/postgres
> 10503.00 2.8% LWLockRelease
> /usr/lib/postgresql/9.2/bin/postgres
> 8802.00 2.3% LWLockAcquire

> We try to limit a connection to 300, but I am not sure if this issue is not
> related to some Postgres bug.

We've seen this issue repeatedly now. None of the times it turned out to
be a bug, but just limitations in postgres' scalability. If you can I'd
strongly suggest trying to get postgres binaries compiled with
-fno-omit-frame-pointer installed to check which locks are actually
conteded.
My bet is BufMappingLock.

There's a CF entry about changing our lwlock implementation to scale
better...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: spinlocks storm bug
Date: 2013-12-06 10:18:15
Message-ID: CAFj8pRD1ybs=UOW50CxZ_=ZDZFs2aXSE5r7ZZUSFkgR4k0nYrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2013/12/6 Andres Freund <andres(at)2ndquadrant(dot)com>

> On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote:
> > I have a report of critical bug (database is temporary unavailability ..
> > restart is necessary).
>
> > PostgreSQL 9.2.4,
> > 24 CPU
> > 140G RAM
> > SSD disc for all
> >
> >
> > Database is under high load. There is a few databases with very high
> number
> > of similar simple statements. When application produce higher load, then
> > number of active connection is increased to 300-600 about.
> >
> > In some moment starts described event - there is a minimal IO, all CPU
> are
> > on 100%.
> >
> > Perf result shows:
> > 354246.00 93.0% s_lock
> > /usr/lib/postgresql/9.2/bin/postgres
> > 10503.00 2.8% LWLockRelease
> > /usr/lib/postgresql/9.2/bin/postgres
> > 8802.00 2.3% LWLockAcquire
>
> > We try to limit a connection to 300, but I am not sure if this issue is
> not
> > related to some Postgres bug.
>
> We've seen this issue repeatedly now. None of the times it turned out to
> be a bug, but just limitations in postgres' scalability. If you can I'd
> strongly suggest trying to get postgres binaries compiled with
> -fno-omit-frame-pointer installed to check which locks are actually
> conteded.
> My bet is BufMappingLock.
>
> There's a CF entry about changing our lwlock implementation to scale
> better...
>
>
one missing info - the customer's staff reduced shared buffers from 30G to
5G without success. A database is 20G about.

Regards

Pavel

> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>