Re: max_connections reached in postgres 9.3.3

From: "Vasudevan, Ramya" <ramya(dot)vasudevan(at)classmates(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: max_connections reached in postgres 9.3.3
Date: 2014-06-12 18:23:12
Message-ID: 20EE50F73664E744AF948F0106FE6DFA2AD0C9B2@SEAMBX01.sea.corp.int.untd.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you for the response.

On further investigation, we found out that select statements were happening normally. But DMLs (writes to the DB) were hung for minutes at a time, and some of them went through. And we had 2 checkpoints during this period. Yesterday when we had the issue, we had 759 connections that were not idle (116 COMMIT, 238 INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND). So, it looked like writes (simple inserts and updates) were not happening as fast and caused connections to back up in the DB. So, it didn’t look like any one bad query.
It almost seems like postgres could not write to the WAL logs.

We normally have 600-700 connections in the database. Since we migrated lot more applications to this postgres database from oracle, we increased max_connections just as a test to see if we legitimately need to allow more connections or if it is an issue. And quickly realized that we already had a high number (1500)

Thank You
Ramya

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure(at)gmail(dot)com]
Sent: Wednesday, June 11, 2014 4:24 PM
To: Vasudevan, Ramya
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] max_connections reached in postgres 9.3.3

On Wed, Jun 11, 2014 at 1:24 PM, Vasudevan, Ramya <ramya(dot)vasudevan(at)classmates(dot)com> wrote:
> Our set up:
>
> · Db version: postgres 9.3.3
>
> · OS: CentOS 6.5
>
> · kernel Version - Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
> 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> · cpu - 24 proc
>
> · memory - 768 GB
>
> · The disks are SAN fiber.
>
> · We have streaming replication set up from this server to the
> secondary server.
>
> · Some of the DB parameters:
>
> max_connections - 1500
>
> shared_buffers - 4GB
>
> work_mem = 130MB
>
> maintenance_work_mem = 1GB
>
>
>
> Issue this morning:
>
> At 9:04 - FATAL: remaining connection slots are reserved for
> non-replication superuser connections
>
> These were the things we noticed and have question on each one of them:
>
> pg_stat_activity showed a total of 850 connections but a 'ps -ef|grep
> postgers' showed 1500 connections and DB wasn’t accepting new connections.
> How is it possible to have such discrepancy in count between
> pg_stat_activity and ps?
>
> 200 were in waiting status, 50 were in authentication status, 748 were
> in IDLE status and the rest were all doing some DML operations. Why
> will 50 connections be in authentication status in the database and stay there?
>
> Some existing sessions were completing the sqls successfully (though
> not as fast). So, the database was not completely hung or unresponsive.
>
> Trying to kill the session at the DB level (pg_terminate_backend) were
> just hanging. Why would this be the case?
>
> During this time, server had a very high load average of 130 (normally
> we have < 5).
>
> We had to restart the database to clear it all up.

There are two problems here:
1. there was some type of 'high load' event. Perhaps it was spurious, like one of the various o/s mishaps like THP or something on your end like a bad query plan or some database impacting application bug.
When this happens, queries are accumulating faster than they are clearing and load escalates

2. your database is configured with very high max connection count and your application servers open connections without bound

This is a very common problem: there is a trigger that causes some type of huccup then queries storm in and overload the database.
Raising the max_connection count is not a good solution beyond a certain threshold as this will simple make your load issues more severe when you are trying to rescue the database and you are facing emergency restart.

Obviously fixing #1 is ideal. The problem is, all too often by the time it comes to your attention you are in emergency fix mode as the server is unresponsive (or barely-) even from psql.

One way of managing problem #2 is to install something like pgbouncer which is a 'man in the middle' connection pooler and virtualizes the connection count so that you only have a very well defined number of physical connections actually in use by the database. Installing this is not to be taken lightly but what it *does* do is protect the database when something like occurs -- load will be capped and you can still get emergency diagnostic information out of the server (like perf or sar) query logs out of pg_stat_activity, and pull of cancels and things like that.

merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2014-06-12 18:25:50 Re: max_connections reached in postgres 9.3.3
Previous Message Keith Fiske 2014-06-12 17:59:57 Re: [GENERAL] Question about partial functional indexes and the query planner