Re: autovacuum launcher process eating up 17G+ of ram?

Lists: pgsql-admin
From: Nick <t32(at)2thebatcave(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-21 21:14:25
Message-ID: AANLkTimGvWugpjwN1Ki=qmd56nsc4gXH6yKT5765nFPK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

I have a production server running postgres 8.3.11. I did a dump all
and loaded up postgres 9.0.1 on another server. On the new server,
the postgres autovacuum launcher process eats up an insane amount of
ram (I have seen 17G virt with 6.5G res). On the older version, it's
at a reasonable 9MB res after running for a month straight.

If I change the autovacuum_naptime parameter to 1, I can actually see
the memory count up in MB with top. It looks like after a while it
grows faster, with the resident memory going up about 1MB every 6-8
seconds. After about 35 minutes, the virt was 455MB and the res was
296MB.

The only change I made in the postgresql.conf (on both versions) is
setting the maximum number of connections to 1000. So all other
options are default to their particular postgres versions. Both have
the same number of databases and data. The number of databases is
134. There are no queries running on the test box with the new
version. Both are running RHEL 5 64-bit for an operating system.

Any idea how to fix this? I would really like to upgrade but this is
going to be a deal breaker.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Nick" <t32(at)2thebatcave(dot)com>,<pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-21 21:59:19
Message-ID: 4CC071670200002500036C8A@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Nick <t32(at)2thebatcave(dot)com> wrote:

> I have a production server running postgres 8.3.11. I did a dump
> all and loaded up postgres 9.0.1 on another server. On the new
> server, the postgres autovacuum launcher process eats up an insane
> amount of ram (I have seen 17G virt with 6.5G res).

You're not looking at this in top and adding up the RAM per process,
are you? That is notoriously unreliable; in particular it tends to
count the shared memory over and over. What does free say about
total usage? How do things look with vmstat 1?

> The only change I made in the postgresql.conf (on both versions)
> is setting the maximum number of connections to 1000.

Unless you have four or five hundred cores in the box, that's
probably a bad idea. You should seriously look at using a
connection pool. Search the archives for previous discussions on
that topic.

> So all other options are default to their particular postgres
> versions.

Since the defaults are designed to allow the server to come up on
even a very small machine, you should probably be setting several
more. You might want to start here:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

-Kevin


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Nick <t32(at)2thebatcave(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-22 02:22:16
Message-ID: AANLkTimRUjJta-3kyyXWg2FiFcza1f2RdwWu6hF4ciye@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Thu, Oct 21, 2010 at 3:14 PM, Nick <t32(at)2thebatcave(dot)com> wrote:
> I have a production server running postgres 8.3.11.  I did a dump all
> and loaded up postgres 9.0.1 on another server.  On the new server,
> the postgres autovacuum launcher process eats up an insane amount of
> ram (I have seen 17G virt with 6.5G res).  On the older version, it's
> at a reasonable 9MB res after running for a month straight.

What's you're shared_buffers set to?

What does VIRT, RES and SHR? Is SHR really big? If RES is close to
SHR, then don't worry too much.


From: Nick <t32(at)2thebatcave(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-22 17:28:03
Message-ID: AANLkTimKs23hytr6oj1Kco53rbaPF2mKTZZYofr0Cr0+@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Yes I was using top, but I am not adding up anything. I'm only ever
looking at the one process called "postgres: autovacuum launcher
process", not any of the other postgres processes.

Here is a vmstat 1 right after postgres has been started:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 55480 760172 1709424 3758492 0 12 4 294 23 26 3
2 94 0 0
0 0 55480 756080 1709460 3762308 0 0 8 6836 1139 442 10
8 79 3 0
0 0 55480 755088 1709492 3763328 0 0 0 0 1044 409 10
8 83 0 0
0 0 55480 754096 1709544 3763992 0 0 0 64 1023 405 10
7 83 0 0
1 0 55480 749696 1709692 3766816 0 0 2028 0 1141 539 11
9 74 7 0
0 0 55480 746544 1709764 3769900 0 0 2164 0 1128 527 11
7 83 0 0
1 0 55480 743332 1709812 3773040 0 0 2216 7440 1314 572 10
8 80 2 0
3 0 55480 739808 1709864 3776044 0 0 2128 0 1170 541 11
8 81 0 0
0 0 55480 737356 1709928 3779260 0 0 2216 0 1149 552 10
8 82 0 0

Here is another chunk after it's been running for about 18 hours:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 1812168 20332 59452 102172 0 12 5 302 8 19 3
3 93 0 0
0 0 1812168 19960 59456 102164 0 0 0 3260 1051 540 8
7 85 1 0
0 0 1812168 19932 59456 102324 0 0 0 0 1022 585 8
8 83 0 0
1 0 1812168 19056 59456 102620 0 0 0 0 1035 541 7
7 86 0 0
0 0 1812168 19808 59456 102128 0 0 0 0 1024 572 8
9 84 0 0
0 1 1812168 20304 59460 102124 0 0 0 28 1035 553 8
8 85 0 0
1 0 1812168 19428 59476 102312 0 0 0 3152 1031 535 8
6 84 2 0
0 0 1812168 19808 59476 102168 0 0 0 8 1041 573 8
9 83 0 0
1 0 1812168 18924 59476 102316 0 0 0 0 1016 526 9
8 84 0 0
0 0 1812168 18940 59476 102316 0 0 0 0 1042 535 7
8 85 0 0
0 0 1812168 19560 59476 102168 0 0 0 0 1021 439 8
8 85 0 0
0 0 1812168 19312 59488 102156 0 0 0 3136 1045 454 7
8 84 1 0
0 0 1812168 18692 59496 102168 0 0 0 3236 1221 505 9
8 82 2 0
1 0 1812168 21432 59496 102184 0 0 0 0 1044 498 8
8 85 0 0
0 0 1812168 22184 59496 102128 0 0 0 0 1019 511 7
6 86 0 0
0 0 1812168 21936 59496 102168 0 0 0 0 1039 582 8
9 84 0 0
0 0 1812168 21316 59504 102168 0 0 0 3284 1030 450 8
7 85 1 0
0 0 1812168 21440 59504 102168 0 0 0 0 1034 466 8
7 85 0 0
1 0 1812168 20812 59512 102168 0 0 0 16 1021 546 9
7 84 1 0
0 0 1812168 21316 59512 102168 0 0 0 0 1039 532 7
8 85 0 0

Here is a snapshot of all the postgres processes as listed in top
(29582 = postgres: autovacuum launcher process):

29582 pgsql 15 0 8739m 6.7g 1816 S 8.9 86.1 91:10.44 postgres
29583 pgsql 15 0 108m 7648 440 S 13.3 0.1 143:04.91 postgres
29578 pgsql 15 0 157m 3704 2992 S 0.3 0.0 1:46.29 postgres
29580 pgsql 15 0 157m 1472 760 S 0.0 0.0 0:00.78 postgres
29581 pgsql 15 0 157m 1204 492 S 0.0 0.0 0:00.12 postgres

As you can see, after a while the system is using a lot of swap. The
first time I noticed the problem, the swap was up to 12G (there is 8GB
of ram in this system). I believe postgres was running for several
days to achieve that much memory utilization.

Thanks for the other thoughts on tuning. I will investigate those
once I figure out what is up with this memory utilization problem.

On Thu, Oct 21, 2010 at 2:59 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Nick <t32(at)2thebatcave(dot)com> wrote:
>
>> I have a production server running postgres 8.3.11.  I did a dump
>> all and loaded up postgres 9.0.1 on another server.  On the new
>> server, the postgres autovacuum launcher process eats up an insane
>> amount of ram (I have seen 17G virt with 6.5G res).
>
> You're not looking at this in top and adding up the RAM per process,
> are you?  That is notoriously unreliable; in particular it tends to
> count the shared memory over and over.  What does free say about
> total usage?  How do things look with vmstat 1?
>
>> The only change I made in the postgresql.conf (on both versions)
>> is setting the maximum number of connections to 1000.
>
> Unless you have four or five hundred cores in the box, that's
> probably a bad idea.  You should seriously look at using a
> connection pool.  Search the archives for previous discussions on
> that topic.
>
>> So all other options are default to their particular postgres
>> versions.
>
> Since the defaults are designed to allow the server to come up on
> even a very small machine, you should probably be setting several
> more.  You might want to start here:
>
> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
>
> -Kevin
>


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Nick" <t32(at)2thebatcave(dot)com>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-22 18:17:42
Message-ID: 4CC18EF60200002500036CF3@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Nick <t32(at)2thebatcave(dot)com> wrote:

> Here is a vmstat 1 right after postgres has been started:
>
> procs -----------memory----------
> r b swpd free buff cache
> 0 0 55480 760172 1709424 3758492

> Here is another chunk after it's been running for about 18 hours:
>
> procs -----------memory----------
> r b swpd free buff cache
> 0 0 1812168 20332 59452 102172

> Here is a snapshot of all the postgres processes as listed in top
> (29582 = postgres: autovacuum launcher process):
>
> 29582 pgsql 15 0 8739m 6.7g 1816 S 8.9 86.1 91:10.44
> postgres

Ouch!

I've not seen anything remotely like that. I wonder what's
different....

> As you can see, after a while the system is using a lot of swap.
> The first time I noticed the problem, the swap was up to 12G
> (there is 8GB of ram in this system). I believe postgres was
> running for several days to achieve that much memory utilization.

Is there anything unusual in the logs?

-Kevin


From: Nick <t32(at)2thebatcave(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-25 17:01:28
Message-ID: AANLkTi=Mh=pv7Ai+L17cbjgsExMbncHG5fMTvj5_wCAe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Fri, Oct 22, 2010 at 11:17 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> I've not seen anything remotely like that.  I wonder what's
> different....

I did a test with the same postgres build (and autovacuum_naptime = 1
as well) on the same OS with only the system databases (postgres,
template0, template1), and after the weekend the memory utilization
was acceptable:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14879 pgsql 15 0 312m 177m 1108 S 0.0 8.8 0:24.43 postgres

However on the other machine which has 136 total databases (133 user +
the same 3 system ones listed above), the kernel had to kill off the
postgres process since the box ran out of memory (the box has 8GB ram
+ 30GB swap), and I see it climbing again.

It seems like the problem is a function of how many databases there are.

> Is there anything unusual in the logs?

I set log_autovacuum_min_duration = 0 but that doesn't even seem to
show that anything is being vacuumed. The only thing unusual in the
logs was a result of the box running out of memory and the kernel
killing off processes.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Nick" <t32(at)2thebatcave(dot)com>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-25 17:31:59
Message-ID: 4CC578BF0200002500036DC0@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Nick <t32(at)2thebatcave(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>> I've not seen anything remotely like that. I wonder what's
>> different....
>
> I did a test with the same postgres build (and autovacuum_naptime
> = 1 as well) on the same OS with only the system databases
> (postgres, template0, template1), and after the weekend the memory
> utilization was acceptable:

That's probably it. Our shop normally has just one application
database in a cluster, and I can't think of any clusters with more
than three or four. It sounds like there might be some sort of
memory leak when a vacuum worker switches databases.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nick <t32(at)2thebatcave(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-admin(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-10-25 19:57:10
Message-ID: 4232.1288036630@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Nick <t32(at)2thebatcave(dot)com> writes:
> It seems like the problem is a function of how many databases there are.

Yeah, I can reproduce this in HEAD. Create a hundred or so databases,
and make sure there's a stats table entry for each, eg

create database d0;
\c d0
create table t1 as select generate_series(1,100000) x;
create database d1;
\c d1
create table t1 as select generate_series(1,100000) x;
create database d2;
\c d2
create table t1 as select generate_series(1,100000) x;
...

Set autovacuum_naptime to 1sec, and watch it bloat. (It bloats faster
once it's run through the databases once and isn't mostly waiting for
children.)

It looks to me like the problem is that get_database_list()
intentionally pushes its result into a long-lived context, and then the
result is never cleaned up. It doesn't help any that it does a lot of
leaky things like heap_beginscan with CurrentMemoryContext totally
unnecessarily pointing at the long-lived context.

I'm not sure what the intent was here, but it's pretty broken as-is.
Alvaro, what did you have in mind? Should this be using a less
long-lived context, or do we need code to free an avw_dbase list?

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Nick <t32(at)2thebatcave(dot)com>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-11-08 15:01:51
Message-ID: 1289227352-sup-6579@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Excerpts from Tom Lane's message of lun oct 25 16:57:10 -0300 2010:

> It looks to me like the problem is that get_database_list()
> intentionally pushes its result into a long-lived context, and then the
> result is never cleaned up. It doesn't help any that it does a lot of
> leaky things like heap_beginscan with CurrentMemoryContext totally
> unnecessarily pointing at the long-lived context.
>
> I'm not sure what the intent was here, but it's pretty broken as-is.
> Alvaro, what did you have in mind? Should this be using a less
> long-lived context, or do we need code to free an avw_dbase list?

Sorry for not noticing this earlier. Seems my mail filters are broken
:-( Thanks Tom for the ping via private email.

Yes, get_database_list is clearly broken as is. I don't remember why
it's coded like this; this memcxt shuffling was introducing in the 9.0
rewrite to read pg_database directly instead of the flat file.

The right way for it to work seems to be to allocate the output list in
the caller's context; both current callers are in a throwaway context and
so the allocated list will go away automatically when they're done with
the list.

Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
need to be patched, but I'll run the test case now just to be sure.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
0001-Fix-permanent-memory-leak-in-autovacuum-launcher.patch application/octet-stream 2.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Nick <t32(at)2thebatcave(dot)com>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-11-08 15:15:02
Message-ID: 24906.1289229302@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
> need to be patched, but I'll run the test case now just to be sure.)

Possibly s/cxt/resultcxt/, or some other less-generic name. Seems
pretty sane other than that cosmetic issue.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Nick <t32(at)2thebatcave(dot)com>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: autovacuum launcher process eating up 17G+ of ram?
Date: 2010-11-08 22:09:31
Message-ID: 1289253773-sup-7048@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Excerpts from Tom Lane's message of lun nov 08 12:15:02 -0300 2010:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
> > need to be patched, but I'll run the test case now just to be sure.)
>
> Possibly s/cxt/resultcxt/, or some other less-generic name. Seems
> pretty sane other than that cosmetic issue.

Thanks, committed that way and pushed. I verified that 8.4 doesn't have
this problem.

Oops, I just noticed that I forgot to credit everyone for the report and
diagnosis in the commit message. Too late to fix :-(

So: thanks Nick for the report and Tom for the diagnosis.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support