PG quitting sporadically!!

Lists: pgsql-generalpgsql-patches
From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: PG quitting sporadically!!
Date: 2008-02-14 12:31:26
Message-ID: e373d31e0802140431p23e097ceu5ed793cc2919605b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

Hi,

I have been running PG for quite a while now.

Suddenly, the postmaster either hogs up memory or just croaks and
doesn't respond. The write process has become horribly slow.

Could it be that my data has grown so large that it's becoming an
issue? Even connecting to PGSQL as postgres user in SSH is way too
slow.

I am on a 4GB server with Apache 2.2.6 and PgSql 8.2.3. The same
server runs Apache, PgSQL, and MySQL. The last is not used for much,
just some simple logging and such. It seldom shows up in "TOP" output.

So the main heavy lifters are Apache and PgSQL.

In my PG_LOG files, I have many messages like these:

LOG: could not send data to client: Broken pipe
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection

What are these things? My conf file settings (which were working fine
until recently!) are as follows:

max_connections = 150
shared_buffers = 330MB
effective_cache_size = 512000
max_fsm_relations = 100
max_fsm_pages = 300000
work_mem = 100MB
temp_buffers = 4096
authentication_timeout = 10s
ssl = off

autovacuum = on
autovacuum_vacuum_cost_delay = 20
vacuum_cost_delay = 20
autovacuum_naptime = 10
stats_start_collector = on
stats_row_level = on
autovacuum_vacuum_threshold = 75
autovacuum_analyze_threshold = 25
autovacuum_analyze_scale_factor = 0.02
autovacuum_vacuum_scale_factor = 0.01

wal_buffers = 128
checkpoint_segments = 64
checkpoint_timeout = 900
fsync = on
maintenance_work_mem = 512MB

------------------------------------------------

1. VMSTAT -S

~ > vmstat -s
4148844 total memory
3771392 used memory
3232328 active memory
265652 inactive memory
377452 free memory
75044 buffer memory
2890740 swap cache
2096440 total swap
6436 used swap
2090004 free swap
24821214 non-nice user cpu ticks
229660 nice user cpu ticks
14153129 system cpu ticks
566198940 idle cpu ticks
22918768 IO-wait cpu ticks
75387 IRQ cpu ticks
0 softirq cpu ticks
153625372 pages paged in
287656751 pages paged out
175855 pages swapped in
859894 pages swapped out
1844498357 interrupts
889200515 CPU context switches
1201419806 boot time
15735402 forks
------------------------------------------------

2. FREE -M

total used free shared buffers cached
Mem: 4051 4025 26 0 74 2823
-/+ buffers/cache: 1127 2924
Swap: 2047 6 2041

------------------------------------------------

3. VACUUM ANALYE OUTPUT

This is the last few lines of this output:

0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "clients": found 1 removable, 3031359 nonremovable row versions
in 199194 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 2624366 unused item pointers.
97747 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 235.21 sec.
INFO: vacuuming "pg_toast.pg_toast_50002"
INFO: index "pg_toast_50002_index" now contains 20 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_50002": found 0 removable, 20 nonremovable row
versions in 7 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 12 unused item pointers.
6 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.02 sec.
INFO: analyzing "public.clients"
INFO: "clients": scanned 7500 of 199194 pages, containing 114197 live
rows and 1 dead rows; 7500 rows in sample, 3032981 estimated total
rows
INFO: free space map contains 98498 pages in 25 relations
DETAIL: A total of 98720 page slots are in use (including overhead).
98720 page slots are required to track all free space.
Current limits are: 300000 page slots, 100 relations, using 1766 kB.

------------------------------------------------


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: PG quitting sporadically!!
Date: 2008-02-14 14:41:39
Message-ID: Pine.GSO.4.64.0802140908380.19471@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

On Thu, 14 Feb 2008, Phoenix Kiula wrote:

> Suddenly, the postmaster either hogs up memory or just croaks and
> doesn't respond. The write process has become horribly slow.

You should consider whether that's because a checkpoint is happening at
that point. You didn't mention anything about your disk+controller
information to have an idea how likely that is. Consider increasing
checkpoint_warning=3600 so that you'll always get a note in the logs when
a checkpoint happens; if those line up with your client disconnects that
will be telling you something.

On the logging size, you may want to also enable
log_min_duration_statement ; around 500 (milliseconds) would be a
reasonable starting value. That will show you what queries are taking a
long time to handle.

> I am on a 4GB server with Apache 2.2.6 and PgSql 8.2.3.

Be aware that when 8.2.3 was released, 8.2 had only been out for two
months. There's another 11 months worth of accumulated bug fixes in
8.2.6, including some that can cause the server to slow or crash. It's
not a difficult upgrade (no changes to the database) and you should
consider it. There are plenty of known and already fixed problems in
8.2.3 you could be running into.

> max_connections = 150
> maintenance_work_mem = 512MB
> shared_buffers = 330MB
> work_mem = 100MB

That's a really high setting for work_mem with this many connections; are
you aware that combination can easily use 15GB of RAM? You should
decrease that to around 10MB with the size of your server and greatly
reduce one possible source for running out of memory. It's possible to
increase that value just for some individual queries if there's some known
set of ones that really need more memory to work efficiently (looking at
the minimum duration logs should give you guidance here).

I'd normally suggest increasing shared_buffers instead as well, but you
should rule out checkpoints before doing that (increasing shared_buffes
can make checkpoint issues worse). You could also decrease
maintenance_work_mem quite a bit from where you've got it at now to reduce
another possible source for large memory allocations.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: PG quitting sporadically!!
Date: 2008-02-15 00:10:36
Message-ID: e373d31e0802141610w4ff1cb60j2a0e3ae7a400e475@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

Thanks. Comments below. (PS: I am still unable to connect to
postgresql even in SSH! I see this message:

psql: could not connect to server: Connection timed out
Is the server running on host "localhost" and accepting
TCP/IP connections on port 5432?

Yes of course the localhost is running the pgsql server and that port
is allowed!

> You should consider whether that's because a checkpoint is happening at
> that point. You didn't mention anything about your disk+controller
> information to have an idea how likely that is. Consider increasing
> checkpoint_warning=3600 so that you'll always get a note in the logs when
> a checkpoint happens; if those line up with your client disconnects that
> will be telling you something.
>

I am not sure what checkpoint stuff means. But I added that entry, and
now my log has ONLY this:

LOG: test message did not get through on socket for statistics collector
LOG: disabling statistics collector for lack of working socket
LOG: database system was shut down at 2008-02-14 17:53:13 CST
LOG: checkpoint record is at 8/E4BE7CF8
LOG: redo record is at 8/E4BE7CF8; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 0/296662974; next OID: 89700
LOG: next MultiXactId: 1; next MultiXactOffset: 0
LOG: database system is ready

> On the logging size, you may want to also enable
> log_min_duration_statement ; around 500 (milliseconds) would be a
> reasonable starting value. That will show you what queries are taking a
> long time to handle.

I have this at 5000. I think I first want to track those that take
more than 5 seconds. But for now, the log is useless as mentioned
above.

> Be aware that when 8.2.3 was released, 8.2 had only been out for two
> months. There's another 11 months worth of accumulated bug fixes in
> 8.2.6, including some that can cause the server to slow or crash. It's
> not a difficult upgrade (no changes to the database) and you should
> consider it. There are plenty of known and already fixed problems in
> 8.2.3 you could be running into.

Thanks, I can do that, but the upgrade process is not very simple or
automated and will take backup of database and all that rigmarole.Last
time I remember, I had to seek help from this forum to do that simple
task, because something had gone remiss.

Is there an easy RPM method of upgrading postgresql without
backingup/restoring etc? I am on CentOS 4.

> > max_connections = 150
> > maintenance_work_mem = 512MB
> > shared_buffers = 330MB
> > work_mem = 100MB
>
> That's a really high setting for work_mem with this many connections; are
> you aware that combination can easily use 15GB of RAM?

I didn't know that, but it had been working for over a year. I have
4GB of RAM. I have changed this now to 10MB, the work_mem, but that
isn't helping.

I'd appreciate any other info. Thanks!


From: "Douglas McNaught" <doug(at)mcnaught(dot)org>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: PG quitting sporadically!!
Date: 2008-02-15 00:58:29
Message-ID: 5ded07e00802141658i5152437ewcb1fa0328c2253a6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

> > Be aware that when 8.2.3 was released, 8.2 had only been out for two
> > months. There's another 11 months worth of accumulated bug fixes in
> > 8.2.6, including some that can cause the server to slow or crash. It's
> > not a difficult upgrade (no changes to the database) and you should
> > consider it. There are plenty of known and already fixed problems in
> > 8.2.3 you could be running into.
>
>
>
> Thanks, I can do that, but the upgrade process is not very simple or
> automated and will take backup of database and all that rigmarole.Last
> time I remember, I had to seek help from this forum to do that simple
> task, because something had gone remiss.
>
> Is there an easy RPM method of upgrading postgresql without
> backingup/restoring etc? I am on CentOS 4.

You don't need a dump/restore to go between minor releases (e.g. 8.2.3
to 8.2.6). Just install the new binaries and fire it up. Taking a
backup beforehand isn't a bad idea, of course. :)

-Doug


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: PG quitting sporadically!!
Date: 2008-02-15 20:12:14
Message-ID: 20080215201214.GF22581@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

Phoenix Kiula escribió:
> Thanks. Comments below. (PS: I am still unable to connect to
> postgresql even in SSH! I see this message:
>
> psql: could not connect to server: Connection timed out
> Is the server running on host "localhost" and accepting
> TCP/IP connections on port 5432?
>
> Yes of course the localhost is running the pgsql server and that port
> is allowed!

My educated guess is that your network is behaving funny, or your
firewall is crazy. Or your kernel has bugs. The fact that the
connections work sometimes is one clue; this is the other:

> LOG: test message did not get through on socket for statistics collector
> LOG: disabling statistics collector for lack of working socket

This is a bad sign. For one thing, it means autovacuum, if enabled, is
not really working at all (which can, in turn, explain slowness).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: PG quitting sporadically!!
Date: 2008-02-15 21:36:57
Message-ID: Pine.GSO.4.64.0802151540590.28731@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

On Fri, 15 Feb 2008, Phoenix Kiula wrote:

> I am not sure what checkpoint stuff means. But I added that entry, and
> now my log has ONLY this:
>
> LOG: test message did not get through on socket for statistics collector
> LOG: disabling statistics collector for lack of working socket

If you're getting that, as already suggested you should be chasing down
whatever is going on there before you touch anything else. I was throwing
out a list of possible things that might cause your problems on a fully
working system, but you don't have one of those right now and that's
really strange.

Have you ever run top and hit the C key to see what all the processes were
doing? It labels the major PostgreSQL processes more usefully if you do
that. I'd be curious just what is gobbling up resources on your machine,
this socket error is somewhat disturbing.

Since it sounds like a fairly critical machine you've got going here, if I
were you I'd be thinking a bit about whether it might make sense to
purchase an hour or two of consulting time from someone who really knows
this area. You're doing the right thing asking for help here, but I
wonder whether there's something else going on that would be obvious to an
expert if they logged into your system and poked around a bit. (This is
certainly not an ad for me--I'm not doing consulting right now).

> I can do that, but the upgrade process is not very simple or automated
> and will take backup of database and all that rigmarole...Is there an
> easy RPM method of upgrading postgresql without backingup/restoring etc?
> I am on CentOS 4.

First off: I wouldn't want to introduce another variable here until
there's a better understanding of what's wrong with your existing system.
You certainly should do a true backup here oriented at disaster recovery
before upgrading given the weirdness involved. But the upgrade itself
doesn't require one, just installing new packages.

If you've already installed the PGDG RPMs on your system (I don't know how
else you'd have gotten 8.2.3 onto Centos 4 via RPM) you should be able to
download the new ones for the latest 8.2, put them all into a directory,
and do "rpm -Uvh *.rpm" to get the new ones replacing the old (with the
server shutdown!). *Should* only take a few minutes. I wrote a little
guide to sorting through more complicated upgrades if it comes to that you
can find at http://www.westnet.com/~gsmith/content/postgresql/pgrpm.htm ,
but if the packages are from the same underlying source it's unlikely
you'll have that large of a mess.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [GENERAL] PG quitting sporadically!!
Date: 2008-02-16 21:12:18
Message-ID: 200802162112.m1GLCIf07638@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-patches

Greg Smith wrote:
> On Thu, 14 Feb 2008, Phoenix Kiula wrote:
>
> > Suddenly, the postmaster either hogs up memory or just croaks and
> > doesn't respond. The write process has become horribly slow.
>
> You should consider whether that's because a checkpoint is happening at
> that point. You didn't mention anything about your disk+controller
> information to have an idea how likely that is. Consider increasing
> checkpoint_warning=3600 so that you'll always get a note in the logs when
> a checkpoint happens; if those line up with your client disconnects that
> will be telling you something.

Oh, that is a creative use to get checkpoint_warning to log all
checkpoints. I have added a documentation mention about that.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +