Quick Links

Re: out of memory woes

Lists:	pgsql-general

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	out of memory woes
Date:	2006-12-13 21:49:08
Message-ID:	1166046548.628294.222790@n67g2000cwd.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hi everyone,

First, this group has been good to me, and I thank you guys for the
valuable help I've found here. I come seeking help with another
problem. I am not even sure my problem lies in Postgres, but perhaps
someone here has had a similar problem and could point me in the right
direction.

As I've mentioned in a few other posts, I run a daily job that loads
large amounts of data into a Postgres database. It must run
efficiently, so one of the tricks I do is run table loads, and commands
such as cluster, in parallel. I am having a problem where my cluster
job consistently fails with an out of memory error. It did not always
do this, but at some point a memory leak or "something" built up to the
point where clustering would always fail on one of the two largest
tables. That is, four tables are clustered in parallel. The smaller of
the two finish successfully. The remaining two - the largest - run for
several minutes. Then one of the tables - not always the same one -
gets an out of memory error and fails.

So, suspecting a memory leak, I tried bouncing Postgres, and ran the
clusters again. No luck - failed in the same manner.

I don't know if it means anything, but swap never seems to be used by
the postgres processes (I stared at top and vmstat for a while), though
the VIRT column in top definitely shows something. Here are sample
lines from top while two processed are running:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
30692 postgres 25 0 1033m 854m 459m R 100 14.3 1:10.68 postgres:
secmaster dev_stage [local] SELECT
30691 postgres 23 0 1752m 1.6g 459m R 100 27.5 2:55.60 postgres:
secmaster dev_stage [local] SELECT

(You see SELECT instead of CLUSTER because I wrapped up my cluster
commands in functions. I call them from psql by selecting the
function.)

Sample vmstat output:

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
3 2 724 18392 5052 5599624 0 0 915 2352 5 3 11
2 80 8

It seems that the postgres processes do not want to use swap -- swap
never increases as the processes run. Again I am not sure whether this
is significant.

If I run the clusters sequentially, there is no error - they just take
too long. I was out of ideas so I bounced the server entirely and ran
the clusters in parallel a few times - success! But I don't want to
have to bounce the server regularly.

So, I'm not sure what to do. I need to figure out why the server would
deteriorate in such a way. Any advice that can be offered would be much
appreciated. I can provide any additional information that might be
necessary.

I am running Postgres 8.1.2 on CentOS 4.4 64-bit.

Thanks a lot,
Mark

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Angva" <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-14 16:27:10
Message-ID:	21351.1166113630@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

"Angva" <angvaw(at)gmail(dot)com> writes:
> As I've mentioned in a few other posts, I run a daily job that loads
> large amounts of data into a Postgres database. It must run
> efficiently, so one of the tricks I do is run table loads, and commands
> such as cluster, in parallel. I am having a problem where my cluster
> job consistently fails with an out of memory error.

Could we see the actual error message? If it's a server-side error
then there should also be a memory context dump into the postmaster
log, which would be useful to see also. Lastly, please show the full
schema definition (including indexes) for the troublesome tables.

regards, tom lane

From:	"Brandon Aiken" <BAiken(at)winemantech(dot)com>
To:	"Angva" <angvaw(at)gmail(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: out of memory woes
Date:	2006-12-14 16:41:57
Message-ID:	F8E84F0F56445B4CB39E019EF67DACBA4015F1@exchsrvr.winemantech.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

I wonder if this is related to the Linux memory overcommit problem:

http://www.postgresql.org/docs/current/interactive/kernel-resources.html
#AEN19361

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Angva
Sent: Wednesday, December 13, 2006 4:49 PM
To: pgsql-general(at)postgresql(dot)org
Subject: [GENERAL] out of memory woes

Hi everyone,

So, suspecting a memory leak, I tried bouncing Postgres, and ran the
clusters again. No luck - failed in the same manner.

(You see SELECT instead of CLUSTER because I wrapped up my cluster
commands in functions. I call them from psql by selecting the
function.)

Sample vmstat output:

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
3 2 724 18392 5052 5599624 0 0 915 2352 5 3 11
2 80 8

It seems that the postgres processes do not want to use swap -- swap
never increases as the processes run. Again I am not sure whether this
is significant.

I am running Postgres 8.1.2 on CentOS 4.4 64-bit.

Thanks a lot,
Mark

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-14 16:49:55
Message-ID:	20061214164955.GE24370@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, Dec 13, 2006 at 01:49:08PM -0800, Angva wrote:
> Hi everyone,
>
> First, this group has been good to me, and I thank you guys for the
> valuable help I've found here. I come seeking help with another
> problem. I am not even sure my problem lies in Postgres, but perhaps
> someone here has had a similar problem and could point me in the right
> direction.

As already suggested, we need to see actual error messages.

> I don't know if it means anything, but swap never seems to be used by
> the postgres processes (I stared at top and vmstat for a while), though
> the VIRT column in top definitely shows something. Here are sample
> lines from top while two processed are running:

The figures from top/ps should be taken with a grain of salt, they
include the shared memory which means a lot is being double counted.

> If I run the clusters sequentially, there is no error - they just take
> too long. I was out of ideas so I bounced the server entirely and ran
> the clusters in parallel a few times - success! But I don't want to
> have to bounce the server regularly.

Someone also pointed out overcommit settings. With overcommit turned
off you will get out of memory errors long before it actually runs out.
Perhaps that is the issue? However, we need to see the actual messages
to be sure.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	mwrynn(at)gmail(dot)com
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-14 17:50:07
Message-ID:	1166118607.055679.274360@t46g2000cwa.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Thank you all for the replies. Overcommit is indeed disabled - the
reason we disabled it is that this very same process caused the Linux
oom-killer to kill processes. This was perhaps two months ago. The
setting was changed to, and is currently set to: vm.overcommit_memory=2
...All has been well until recently. I will research overcommit
further.

Tom, below is the information you requested.

Thank you,
Mark

/* SCHEMA DEFINITIONS */

dev_stage=> \d sm_mbs_loan_distribution
Table
"secmaster.sm_mbs_loan_distribution"
Column | Type |
Modifiers
----------------------+---------------+-----------------------------------------------------------------------------------------
rec_type | character(1) | not null
agency | character(3) | not null
pool_number | character(7) | not null
cusip | character(9) | not null
issue_id | integer | not null
eff_date | date | not null
value | character(6) | not null
display_sort_order | character(4) | not null
rpb | numeric(20,2) | not null
pct_rpb | numeric(5,2) | not null
loans | integer | not null
loan_distribution_id | integer | not null default
nextval('sm_mbs_loan_distribution_loan_distribution_id_seq'::regclass)
Indexes:
"sm_mbs_loan_distribution_pkey" PRIMARY KEY, btree
(loan_distribution_id)
"sm_mbs_loan_distribution_idx" UNIQUE, btree (issue_id, rec_type,
value)
"sm_mbs_loan_distribution_idx1" btree (rec_type, value, pct_rpb)
"sm_mbs_loan_distribution_rec_type_loans" btree (issue_id,
rec_type, value, loans)
"sm_mbs_loan_distribution_rec_type_pct_rpb" btree (issue_id,
rec_type, value, pct_rpb) CLUSTER
"sm_mbs_loan_distribution_rec_type_rpb" btree (issue_id, rec_type,
value, rpb)

/* MEMORY CONTEXT DUMP AND ERROR MESSAGE */

TopMemoryContext: 58800 total in 7 blocks; 6480 free (6 chunks); 52320
used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks);
6512 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 7168 total in 3 blocks; 3448 free (0 chunks); 3720 used
SPI Plan: 7168 total in 3 blocks; 3448 free (0 chunks); 3720 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1568 free (0 chunks); 1504 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1600 free (0 chunks); 1472 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1568 free (0 chunks); 1504 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1600 free (0 chunks); 1472 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 424 free (0 chunks); 2648 used
SPI Plan: 3072 total in 2 blocks; 1696 free (0 chunks); 1376 used
PL/PgSQL function context: 24576 total in 2 blocks; 8464 free (7
chunks); 16112 used
PLpgSQL function cache: 22064 total in 2 blocks; 3744 free (0 chunks);
18320 used
CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
TopTransactionContext: 8192 total in 1 blocks; 5984 free (0 chunks);
2208 used
CurTransactionContext: 24576 total in 2 blocks; 12400 free (1 chunks);
12176 used
ExecutorState: 24576 total in 2 blocks; 4496 free (1 chunks); 20080
used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
SPI Exec: 528474160 total in 69 blocks; 309634880 free (9674592
chunks); 218839280 used
ExecutorState: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
SPI Proc: 24576 total in 2 blocks; 16728 free (16 chunks); 7848 used
MessageContext: 8192 total in 1 blocks; 5040 free (3 chunks); 3152 used

smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks);
10672 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 744 free (0 chunks); 280 used
ExecutorState: 8192 total in 1 blocks; 5256 free (6 chunks); 2936 used
ExprContext: 8192 total in 1 blocks; 8160 free (8 chunks); 32 used
Relcache by OID: 24576 total in 2 blocks; 14912 free (3 chunks); 9664
used
CacheMemoryContext: 1040384 total in 7 blocks; 457176 free (16 chunks);
583208 used
sm_mbs_geo_pool_distribution_id_state_upb: 1024 total in 1 blocks; 208
free (0 chunks); 816 used
pg_description_o_c_o_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_attrdef_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_shdepend_depender_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_shdepend_reference_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_depend_depender_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_depend_reference_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0
chunks); 712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808
used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
MdSmgr: 8192 total in 1 blocks; 7104 free (1 chunks); 1088 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5
chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
TopMemoryContext: 58800 total in 7 blocks; 9120 free (27 chunks); 49680
used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks);
6512 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 7168 total in 3 blocks; 3448 free (0 chunks); 3720 used
SPI Plan: 7168 total in 3 blocks; 3448 free (0 chunks); 3720 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1568 free (0 chunks); 1504 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1600 free (0 chunks); 1472 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1504 free (0 chunks); 1568 used
SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1568 free (0 chunks); 1504 used

SPI Plan: 3072 total in 2 blocks; 1712 free (0 chunks); 1360 used
SPI Plan: 3072 total in 2 blocks; 1600 free (0 chunks); 1472 used
SPI Plan: 3072 total in 2 blocks; 1256 free (0 chunks); 1816 used
SPI Plan: 3072 total in 2 blocks; 424 free (0 chunks); 2648 used
SPI Plan: 3072 total in 2 blocks; 1696 free (0 chunks); 1376 used
PL/PgSQL function context: 24576 total in 2 blocks; 8464 free (7
chunks); 16112 used
PLpgSQL function cache: 22064 total in 2 blocks; 3744 free (0 chunks);
18320 used
CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
TopTransactionContext: 8192 total in 1 blocks; 6288 free (5 chunks);
1904 used
ExecutorState: 24576 total in 2 blocks; 4496 free (1 chunks); 20080
used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
SPI Exec: 528474160 total in 69 blocks; 309634880 free (9674592
chunks); 218839280 used
ExecutorState: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
SPI Proc: 24576 total in 2 blocks; 15728 free (9 chunks); 8848 used
MessageContext: 8192 total in 1 blocks; 5040 free (3 chunks); 3152 used
smgr relation table: 24576 total in 2 blocks; 13904 free (4 chunks);
10672 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 744 free (0 chunks); 280 used
ExecutorState: 8192 total in 1 blocks; 5256 free (6 chunks); 2936 used
ExprContext: 8192 total in 1 blocks; 8160 free (8 chunks); 32 used
Relcache by OID: 24576 total in 2 blocks; 14912 free (3 chunks); 9664
used
CacheMemoryContext: 1040384 total in 7 blocks; 463544 free (49 chunks);
576840 used
pg_description_o_c_o_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_attrdef_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_shdepend_depender_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_shdepend_reference_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_depend_depender_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_depend_reference_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_index_indrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 216 free (0 chunks);
808 used
pg_type_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_statistic_relid_att_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_proc_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_operator_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_opclass_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 208 free (0
chunks); 816 used
pg_namespace_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_namespace_nspname_index: 1024 total in 1 blocks; 312 free (0
chunks); 712 used
pg_language_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_language_name_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_authid_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_authid_rolname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_database_datname_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_oid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0
chunks); 1008 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_class_oid_index: 1024 total in 1 blocks; 312 free (0 chunks); 712
used
pg_cast_source_target_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 216 free (0
chunks); 808 used
pg_amproc_opc_proc_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_amop_opr_opc_index: 1024 total in 1 blocks; 216 free (0 chunks); 808
used
pg_amop_opc_strat_index: 1024 total in 1 blocks; 208 free (0 chunks);
816 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 312 free (0 chunks);
712 used
MdSmgr: 8192 total in 1 blocks; 7200 free (4 chunks); 992 used
LockTable (locallock hash): 24576 total in 2 blocks; 15984 free (5
chunks); 8592 used
Timezones: 52560 total in 2 blocks; 3744 free (0 chunks); 48816 used
ErrorContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
<2006-12-13 14:41:46 EST>ERROR: out of memory
<2006-12-13 14:41:46 EST>DETAIL: Failed on request of size 20.

<2006-12-13 14:41:46 EST>CONTEXT: SQL statement "SELECT $1 [ $2 ]"
PL/pgSQL function "cluster_load_tables" line 31 at raise

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	mwrynn(at)gmail(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-17 01:30:33
Message-ID:	9462.1166319033@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

mwrynn(at)gmail(dot)com writes:
> Tom, below is the information you requested.

Well, the table definitions look ordinary enough, but this is odd:

> SPI Exec: 528474160 total in 69 blocks; 309634880 free (9674592
> chunks); 218839280 used

Something's leaking a lot of memory within a SPI call, which means
either a plpgsql-invoked query or a SPI_execute-or-sibling call
from a C-language function. I can see from the memory dump that
you're using some of both, but no details beyond that. Also,
it seems the actual failure is not coming from CLUSTER at all,
but during an attempt to report a message via RAISE:

> <2006-12-13 14:41:46 EST>ERROR: out of memory
> <2006-12-13 14:41:46 EST>DETAIL: Failed on request of size 20.
> <2006-12-13 14:41:46 EST>CONTEXT: SQL statement "SELECT $1 [ $2 ]"
> PL/pgSQL function "cluster_load_tables" line 31 at raise

So I think you're gonna have to show us the functions you're using ...

regards, tom lane

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-17 05:48:03
Message-ID:	1166334483.155591.13470@80g2000cwy.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Tom,

Here is the sole plpgsql function that was called when the error
occurred. This function is intended to be called from a shell script in
order to cluster tables in parallel processes. One calls it with
from_perc and to_perc - the % of statements that are run (e.g. 0% to
14%). (This concept may seem a bit silly with only 7 statements, but
this a convention I use for other functions too, such as creating
indexes - there are many indexes.) I call this function from my shell
script such that only one cluster statement is run at a time, for each
of 7 different processes.

Interesting that the leak is actually in the raise. Could this possibly
be related to the exception handling memory leak I read about? When
searching this newsgroup I found a post of yours about this leak, but
decided it probably is not the issue - I believe I read that the memory
leak is local to a transaction.

Thanks,
Mark

create or replace function cluster_load_tables(from_perc integer,
to_perc integer) returns void as
$$
declare
cmdArr text[7];
max_val integer;
enabled boolean;
begin
raise notice 'cluster_load_tables called %', timeofday();

select cluster_load_tables into enabled from
secmaster_stage.data_load_config;

if enabled = false then
raise notice 'cluster_load_tables disabled - exiting out %',
timeofday();
return;
end if;

cmdArr[0] := 'CLUSTER sm_issue';
cmdArr[1] := 'CLUSTER sm_mbs_pool_detail';
cmdArr[2] := 'CLUSTER sm_mbs_quartile_distribution';
cmdArr[3] := 'CLUSTER sm_mbs_loan_distribution';
cmdArr[4] := 'CLUSTER sm_mbs_geo_pool_distribution';
cmdArr[5] := 'CLUSTER sm_issue_id';
cmdArr[6] := 'CLUSTER sm_pool_prefix';

max_val := 6;

for i in ceiling(from_perc*(max_val/100::numeric(20,1))) ..
floor(to_perc*(max_val/100::numeric(20,1))) loop
--for i in 0 .. 6 loop
begin
execute cmdArr[i];
exception
when others then
raise notice 'failed to execute %; error is: %', cmdArr[i],
sqlerrm;
end;
end loop;

/*
[snip - old commented-out code]
*/

raise notice 'cluster_load_tables done %', timeofday();
end;
$$
language plpgsql;

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Angva" <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-17 20:09:25
Message-ID:	17733.1166386165@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

"Angva" <angvaw(at)gmail(dot)com> writes:
> Here is the sole plpgsql function that was called when the error
> occurred. This function is intended to be called from a shell script in
> order to cluster tables in parallel processes.

OK, I played around with this for a bit, and what I find is that in 8.1,
that SPIExec context is where the sort operation run by CLUSTER's
reindexing step allocates memory. So the short answer is "you've got
maintenance_work_mem set too high". I don't know why it sometimes fails
and sometimes not --- maybe you are reaching the swap-space limits of
the machine when you do several of these in parallel? But there doesn't
seem to be any actual leak here.

BTW, it's also the case that the 8.1 sort code isn't great about
measuring its space usage. I had maintenance_work_mem set to 100MB
and saw actual process size exceeding 200MB ... IIRC, that's been
improved in 8.2.

regards, tom lane

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-18 15:16:56
Message-ID:	1166455016.067279.221080@48g2000cwx.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Tom Lane wrote:
> OK, I played around with this for a bit, and what I find is that in 8.1,
> that SPIExec context is where the sort operation run by CLUSTER's
> reindexing step allocates memory.
Interesting. I wonder if dropping indexes could alleviate this problem.
Please see another recent post of mine for context -
http://groups.google.com/group/pgsql.general/browse_thread/thread/26333d69d7a311ed/39fcfa7ca91771ba?lnk=gst&q=index+angva&rnum=2
The indexes (except for the cluster indexes obviously) should be
dropped and recreated later anyway.

> I don't know why it sometimes fails and sometimes not ---
> maybe you are reaching the swap-space limits of
> the machine when you do several of these in parallel?
The funny thing is that once it does fail, it fails consistently until
the server is bounced - I must have run the cluster script 10 times
after the initial failure. The server's 6g of RAM is normally more than
enough (so normally, maintenance_work_mem is not too high) until this
strange wall is hit. Another funny thing is that according to top, no
more than about 700k of swap is ever used - even when the problem
occurs - and the limit is 2g.

Thanks a lot,
Mark

From:	"hubert depesz lubaczewski" <depesz(at)gmail(dot)com>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 07:48:18
Message-ID:	9e4684ce0612182348p1ee1ad12p1414c877b77dd9b3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 18 Dec 2006 07:16:56 -0800, Angva <angvaw(at)gmail(dot)com> wrote:
>
> The funny thing is that once it does fail, it fails consistently until
> the server is bounced - I must have run the cluster script 10 times
> after the initial failure. The server's 6g of RAM is normally more than
> enough (so normally, maintenance_work_mem is not too high) until this
> strange wall is hit. Another funny thing is that according to top, no
> more than about 700k of swap is ever used - even when the problem
> occurs - and the limit is 2g.
>

could you please show the configure options (shared buffers, work mem, and
maintenance_work_mem), plus; what os you are running and on what
architecture? i.e. 32bit? 64bit? xeon?

best regards,

depesz

--
http://www.depesz.com/ - nowy, lepszy depesz

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 15:01:41
Message-ID:	1166540501.084364.285760@t46g2000cwa.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

"hubert depesz lubaczewski" wrote:
> could you please show the configure options (shared buffers, work mem, and
> maintenance_work_mem), plus; what os you are running and on what
> architecture? i.e. 32bit? 64bit? xeon?
Thank you for your response, Hubert. Here is the info:

shared_buffers = 57344
work_mem = 200000
maintenance_work_mem = 524288

CentOS 4.4 64-bit (x86_64)

Thanks,
Mark

From:	"hubert depesz lubaczewski" <depesz(at)gmail(dot)com>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 17:54:14
Message-ID:	9e4684ce0612190954v7cdfffa6y7c83be58bb941508@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 19 Dec 2006 07:01:41 -0800, Angva <angvaw(at)gmail(dot)com> wrote:
>
> shared_buffers = 57344
> work_mem = 200000
> maintenance_work_mem = 524288
>

work_mem seems to be high. what is you max_connections setting?

depesz

--
http://www.depesz.com/ - nowy, lepszy depesz

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 18:52:50
Message-ID:	1166554370.782416.294100@f1g2000cwa.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

"hubert depesz lubaczewski" wrote:
> On 19 Dec 2006 07:01:41 -0800, Angva <angvaw(at)gmail(dot)com> wrote:
> >
> > shared_buffers = 57344
> > work_mem = 200000
> > maintenance_work_mem = 524288
> >
>
> work_mem seems to be high. what is you max_connections setting?

max_connections = 100

However we never come even close to 100 connections at once on this
database. Rarely, if ever, do we even reach 10. But I am unsure why
work_mem is set to 200000. I personally tuned shared_buffers and
maintenance_work_mem for performance, but I just don't know why
work_mem is what it is. So, I'll gladly accept any suggestions about
this setting. :)

Thanks a lot,
Mark

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 20:20:51
Message-ID:	1166559651.145157.12380@80g2000cwy.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Well the problem is occurring again, but this time it is intermittent.
I think the crux of the issue is that Linux isn't giving swap to
Postgres - and possibly other - processes. Why this is the case I do
not know and will research. I may shrink work_mem or add more RAM, but
I'd rather use swap rather than completely blow up for those times when
the cup runneth over just a bit.

I'll post the resolution to the issue if I find it...

Mark

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-19 21:28:16
Message-ID:	1166563696.532099.250200@80g2000cwy.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

We found that the kernel setting SHMALL was set ridiculously high -
1024g!. Someone noticed this when running "ipcs -lm" - seemed just a
tad off. :)

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 524288
max total shared memory (kbytes) = 1073741824
min seg size (bytes) = 1

Now it is set to ceil(SHMMAX/PAGE_SIZE) as recommended in the Postgres
documentation. Testing the heck out of the parallel clusters now...

Mark

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-20 14:33:29
Message-ID:	20061220143329.GF30769@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Angva wrote:
> We found that the kernel setting SHMALL was set ridiculously high -
> 1024g!. Someone noticed this when running "ipcs -lm" - seemed just a
> tad off. :)

That's not a problem because it's just a limit. It won't cause out of
memory or anything.

The problem with work_mem is that the system may request that much
memory for every Sort step. Each query may have more than one of those,
and each backend can be serving more than one query at a time. So those
200000 you have quickly turn into the gigabytes and consume the whole of
your RAM.

If a Sort step requires more than work_mem for its work, it will go to
disk -- but in a much faster way than what the kernel is able to do with
swap. So whenever you are trading work_mem for swap, you are losing big
time. You should decrease that figure.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-20 15:25:15
Message-ID:	1166628315.373662.284680@48g2000cwx.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Alvaro Herrera wrote:
> That's not a problem because it's just a limit. It won't cause out of
> memory or anything.

Ah, I see. Well, it's nice to have caught that anyway, I suppose.

> The problem with work_mem is that the system may request that much
> memory for every Sort step. Each query may have more than one of those,
> and each backend can be serving more than one query at a time. So those
> 200000 you have quickly turn into the gigabytes and consume the whole of
> your RAM.
>
> If a Sort step requires more than work_mem for its work, it will go to
> disk -- but in a much faster way than what the kernel is able to do with
> swap. So whenever you are trading work_mem for swap, you are losing big
> time. You should decrease that figure.

Thanks for that bit of advice, Alvaro. I did not know that a work_mem
is allocated for each sort step. I will try lowering work_mem gradually
until it impacts performance.

Though even if this fixes my problem, I'm still bothered by swap never
being used. (Not that I expect you to come up with a magic answer for
me.)

Thank you,
Mark

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-21 14:48:13
Message-ID:	1166712493.295725.174750@n67g2000cwd.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Well I adjusted work_mem, ran pg_ctl reload, verified that the setting
change took place by running "show work_mem", but I am noticing zero
difference. I am noticing no performance difference in the clustering,
and the out of memory errors still occur. First I halved work_mem,
reducing it to 100000. I tested that, noticed no difference, then I
reduced work_mem all the way down to 1024. Tested again - same results.

However halving maintenance_work_mem did make a difference. Normally,
when I'm watching top before the big bang (oom), I see one process
running at RES 1.6g...Then the other increases and increases till it
hits about 1g, then kaboom. After halving maintenance_work_mem (262144
now), I see one process running at RES 1.1g instead of the usual 1.6g.
The other process increases to about 900m, and lo and behold, nothing
blows up! I have tested several times now.

But now performance is taking a hit. My boss is eager to slap more RAM
into the server, and I can't blame him. So we will likely add a few
gigs of RAM, revert maintenance_work_mem, and tune work_mem a bit just
to get it at least a reasonable, non-arbitrary value.

Guess I'm about ready to wrap up this thread, but I was just wondering
if Alvaro might have confused work_mem with maintenance_work_mem. The
docs say that work_mem is used for internal sort operations, but they
also say maintenance_work_mem is used for create index. My tests seem
to show that clustering does not use work_mem at all (or my clusters
don't use very much of it for whatever reason). Can anyone confirm
this?

Thank you all for your help,
Mark

From:	"Angva" <angvaw(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-27 15:15:48
Message-ID:	1167232548.609898.15400@42g2000cwt.googlegroups.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Just wanted to post an update. Not going too well. Each time the
scripts were run over this holiday weekend, more statements failed with
out of memory errors, including more and more create index statements
(it had only been clusters previously). Eventually, psql could not even
be called with a very simple one-row update:

psql: could not fork new process for connection: Cannot allocate memory

The logs are nothing but this message repeated at this point:

<2006-12-26 00:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 00:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 00:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 00:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 02:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 02:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory
<2006-12-26 02:00:01 EST>LOG: could not fork new process for
connection: Cannot allocate memory

I think I'm about ready to bounce the server every night unfortunately.

Thanks for listening,
Mark

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-29 10:23:57
Message-ID:	20061229102357.GA10785@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, Dec 27, 2006 at 07:15:48AM -0800, Angva wrote:
> Just wanted to post an update. Not going too well. Each time the
> scripts were run over this holiday weekend, more statements failed with
> out of memory errors, including more and more create index statements
> (it had only been clusters previously). Eventually, psql could not even
> be called with a very simple one-row update:
>
> psql: could not fork new process for connection: Cannot allocate memory

You have memory overcommit turned off, thus it can't fork a new
process. You must find out what it taking up the virtual memory.

Note that in Linux, overcommit off limits the total virtual memory to
swap space + half of real RAM. How much RAM have you got and how much
active swap? If you have a server with lots of ram, you need at least
half that RAM as swap to be able to use it all...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Angva <angvaw(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: out of memory woes
Date:	2006-12-29 23:22:27
Message-ID:	20061229232227.GA32000@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Angva wrote:

> Guess I'm about ready to wrap up this thread, but I was just wondering
> if Alvaro might have confused work_mem with maintenance_work_mem. The
> docs say that work_mem is used for internal sort operations, but they
> also say maintenance_work_mem is used for create index. My tests seem
> to show that clustering does not use work_mem at all (or my clusters
> don't use very much of it for whatever reason). Can anyone confirm
> this?

Huh, sure, "system" operations like cluster, create index or vacuum use
maintenance_work_mem, not work_mem (the latter being used mainly by
regular queries, i.e. select, update, delete). I just hadn't read the
whole background of your problem so I didn't notice you were talking
about cluster. Sorry about being sloppy on that.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.