Improve BULK insertion

Lists: pgsql-performance
From: Grupos <grupos(at)carvalhaes(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Improve BULK insertion
Date: 2004-12-04 13:39:39
Message-ID: 41B1BE1B.10301@carvalhaes.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi !

I need to insert 500.000 records on a table frequently. It´s a bulk
insertion from my applicatoin.
I am with a very poor performance. PostgreSQL insert very fast until the
tuple 200.000 and after it the insertion starts to be really slow.
I am seeing on the log and there is a lot of transaction logs, something
like :

2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000012"
2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000013"
2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000011"
2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000015"
2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000014"
2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000016"
2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000017"
2004-12-04 11:24:10 LOG: recycled transaction log file "0000000600000018"

How can I configure PostgreSQL to have a better performance on this bulk
insertions ? I already increased the memory values.

My data:
Conectiva linux kernel 2.6.9
PostgreSQL 7.4.6 - 1,5gb memory
max_connections = 30
shared_buffers = 30000
sort_mem = 32768
vacuum_mem = 32768
max_fsm_pages = 30000
max_fsm_relations = 1500

The other configurations are default.

Cheers,

Rodrigo Carvalhaes


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Improve BULK insertion
Date: 2004-12-04 14:48:15
Message-ID: m38y8ep1ao.fsf@knuth.knuth.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

In the last exciting episode, grupos(at)carvalhaes(dot)net (Grupos) wrote:
> Hi !
>
> I need to insert 500.000 records on a table frequently. It´s a bulk
> insertion from my applicatoin.
> I am with a very poor performance. PostgreSQL insert very fast until
> the tuple 200.000 and after it the insertion starts to be really slow.
> I am seeing on the log and there is a lot of transaction logs,
> something like :
>
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000012"
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000013"
> 2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000011"
> 2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000015"
> 2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000014"
> 2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000016"
> 2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000017"
> 2004-12-04 11:24:10 LOG: recycled transaction log file "0000000600000018"

It is entirely normal for there to be a lot of transaction log file
recycling when bulk inserts are taking place; that goes through a lot
of transaction logs.

> How can I configure PostgreSQL to have a better performance on this
> bulk insertions ? I already increased the memory values.

Memory is, as likely as not, NOT the issue.

Two questions:

1. How are you doing the inserts? Via INSERT statements? Or
via COPY statements? What sort of transaction grouping
is involved?

COPY is way faster than INSERT, and grouping plenty of updates
into a single transaction is generally a "win."

2. What is the schema like? Does the table have a foreign key
constraint? Does it have a bunch of indices?

If there should eventually be lots of indices, it tends to be
faster to create the table with none/minimal indices, and add
indexes afterwards, as long as your "load" process can be trusted
to not break "unique" constraints...

If there is some secondary table with a foreign key constraint,
and _that_ table is growing, it is possible that a sequential
scan is being used to search the secondary table where, if you
did an ANALYZE on that table, an index scan would be preferred
once it grew to larger size...

There isn't a particular reason for PostgreSQL to "hit a wall" upon
seeing 200K records; I and coworkers routinely load database dumps
that have millions of (sometimes pretty fat) records, and they don't
"choke." That's true whether talking about loading things onto my
(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
array, or higher end stuff involving high end SMP and EMC disk arrays.
The latter obviously being orders of magnitude faster than desktop
equipment :-).
--
(format nil "~S(at)~S" "cbbrowne" "acm.org")
http://www3.sympatico.ca/cbbrowne/unix.html
Rules of the Evil Overlord #207. "Employees will have conjugal visit
trailers which they may use provided they call in a replacement and
sign out on the timesheet. Given this, anyone caught making out in a
closet while leaving their station unmonitored will be shot."
<http://www.eviloverlord.com/>


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Grupos <grupos(at)carvalhaes(dot)net>
Subject: Re: Improve BULK insertion
Date: 2004-12-04 19:14:18
Message-ID: 200412041114.18460.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Rodrigo,

> I need to insert 500.000 records on a table frequently. It´s a bulk
> insertion from my applicatoin.
> I am with a very poor performance. PostgreSQL insert very fast until the
> tuple 200.000 and after it the insertion starts to be really slow.
> I am seeing on the log and there is a lot of transaction logs, something

In addition to what Chris Browne asked:
What's your transaction log setup? Are your database transaction logs on a
seperate disk resource? What is checkpoint_segments and checkpoint_timeout
set to?

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Rodrigo Carvalhaes <grupos(at)carvalhaes(dot)net>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Improve BULK insertion
Date: 2004-12-05 19:52:03
Message-ID: 41B366E3.7040109@carvalhaes.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi!

1. I am doing the inserts using pg_restore. The dump was created using
pg_dump and the standard format (copy statements)
2. See below the table schema. There are only 7 indexes.
3. My transaction log configuration are : checkpoint_segments = 3 and
checkpoint_timeout = 300 and my transaction logs are on the same disk .

I know that I can increase the performance separating the transaction
logs and making a RAID 5 array BUT I am really curious about WHY this
performance is so poor and HOW can I try to improve on this actual
machine because actualy this inserts are taking around 90 minutes!!!

Cheers!

Rodrigo

dadosadv=# \d si2010
Table "public.si2010"
Column | Type | Modifiers
------------+------------------+---------------------------------------------------------------------
i2_filial | character(2) | not null default ' '::bpchar
i2_num | character(10) | not null default ' '::bpchar
i2_linha | character(2) | not null default ' '::bpchar
i2_data | character(8) | not null default ' '::bpchar
i2_dc | character(1) | not null default ' '::bpchar
i2_debito | character(20) | not null default '
'::bpchar
i2_dcd | character(1) | not null default ' '::bpchar
i2_credito | character(20) | not null default '
'::bpchar
i2_dcc | character(1) | not null default ' '::bpchar
i2_moedas | character(5) | not null default ' '::bpchar
i2_valor | double precision | not null default 0.0
i2_hp | character(3) | not null default ' '::bpchar
i2_hist | character(40) | not null default
' '::bpchar
i2_ccd | character(9) | not null default ' '::bpchar
i2_ccc | character(9) | not null default ' '::bpchar
i2_ativdeb | character(6) | not null default ' '::bpchar
i2_ativcrd | character(6) | not null default ' '::bpchar
i2_vlmoed2 | double precision | not null default 0.0
i2_vlmoed3 | double precision | not null default 0.0
i2_vlmoed4 | double precision | not null default 0.0
i2_vlmoed5 | double precision | not null default 0.0
i2_dtvenc | character(8) | not null default ' '::bpchar
i2_criter | character(4) | not null default ' '::bpchar
i2_rotina | character(8) | not null default ' '::bpchar
i2_periodo | character(6) | not null default ' '::bpchar
i2_listado | character(1) | not null default ' '::bpchar
i2_origem | character(40) | not null default
' '::bpchar
i2_permat | character(4) | not null default ' '::bpchar
i2_filorig | character(2) | not null default ' '::bpchar
i2_intercp | character(1) | not null default ' '::bpchar
i2_identcp | character(12) | not null default ' '::bpchar
i2_lote | character(4) | not null default ' '::bpchar
i2_doc | character(6) | not null default ' '::bpchar
i2_emporig | character(2) | not null default ' '::bpchar
i2_lp | character(3) | not null default ' '::bpchar
i2_itemd | character(9) | not null default ' '::bpchar
i2_itemc | character(9) | not null default ' '::bpchar
i2_prelan | character(1) | not null default ' '::bpchar
i2_tipo | character(2) | not null default ' '::bpchar
i2_dcc | character(1) | not null default ' '::bpchar
i2_moedas | character(5) | not null default ' '::bpchar
i2_valor | double precision | not null default 0.0
i2_hp | character(3) | not null default ' '::bpchar
i2_hist | character(40) | not null default
' '::bpchar
i2_ccd | character(9) | not null default ' '::bpchar
i2_ccc | character(9) | not null default ' '::bpchar
i2_ativdeb | character(6) | not null default ' '::bpchar
i2_ativcrd | character(6) | not null default ' '::bpchar
i2_vlmoed2 | double precision | not null default 0.0
i2_vlmoed3 | double precision | not null default 0.0
i2_vlmoed4 | double precision | not null default 0.0
i2_vlmoed5 | double precision | not null default 0.0
i2_dtvenc | character(8) | not null default ' '::bpchar
i2_criter | character(4) | not null default ' '::bpchar
i2_rotina | character(8) | not null default ' '::bpchar
i2_periodo | character(6) | not null default ' '::bpchar
i2_listado | character(1) | not null default ' '::bpchar
i2_origem | character(40) | not null default
' '::bpchar
i2_permat | character(4) | not null default ' '::bpchar
i2_filorig | character(2) | not null default ' '::bpchar
i2_intercp | character(1) | not null default ' '::bpchar
i2_identcp | character(12) | not null default ' '::bpchar
i2_lote | character(4) | not null default ' '::bpchar
i2_doc | character(6) | not null default ' '::bpchar
i2_emporig | character(2) | not null default ' '::bpchar
i2_lp | character(3) | not null default ' '::bpchar
i2_itemd | character(9) | not null default ' '::bpchar
i2_itemc | character(9) | not null default ' '::bpchar
i2_prelan | character(1) | not null default ' '::bpchar
i2_tipo | character(2) | not null default ' '::bpchar
d_e_l_e_t_ | character(1) | not null default ' '::bpchar
r_e_c_n_o_ | double precision | not null default 0.0
Indexes:
"si2010_pkey" primary key, btree (r_e_c_n_o_)
"si20101" btree (i2_filial, i2_num, i2_linha, i2_periodo,
r_e_c_n_o_, d_e_l_e_t_)
"si20102" btree (i2_filial, i2_periodo, i2_num, i2_linha,
r_e_c_n_o_, d_e_l_e_t_)
"si20103" btree (i2_filial, i2_data, i2_num, i2_linha, r_e_c_n_o_,
d_e_l_e_t_)
"si20104" btree (i2_filial, i2_debito, i2_data, i2_num, i2_linha,
r_e_c_n_o_, d_e_l_e_t_)
"si20105" btree (i2_filial, i2_credito, i2_data, i2_num, i2_linha,
r_e_c_n_o_, d_e_l_e_t_)
"si20106" btree (i2_filial, i2_doc, i2_periodo, r_e_c_n_o_, d_e_l_e_t_)
"si20107" btree (i2_filial, i2_origem, r_e_c_n_o_, d_e_l_e_t_)

Christopher Browne wrote:

>In the last exciting episode, grupos(at)carvalhaes(dot)net (Grupos) wrote:
>
>
>>Hi !
>>
>>I need to insert 500.000 records on a table frequently. It´s a bulk
>>insertion from my applicatoin.
>>I am with a very poor performance. PostgreSQL insert very fast until
>>the tuple 200.000 and after it the insertion starts to be really slow.
>>I am seeing on the log and there is a lot of transaction logs,
>>something like :
>>
>>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000012"
>>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000013"
>>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000011"
>>2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000015"
>>2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000014"
>>2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000016"
>>2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000017"
>>2004-12-04 11:24:10 LOG: recycled transaction log file "0000000600000018"
>>
>>
>
>It is entirely normal for there to be a lot of transaction log file
>recycling when bulk inserts are taking place; that goes through a lot
>of transaction logs.
>
>
>
>>How can I configure PostgreSQL to have a better performance on this
>>bulk insertions ? I already increased the memory values.
>>
>>
>
>Memory is, as likely as not, NOT the issue.
>
>Two questions:
>
> 1. How are you doing the inserts? Via INSERT statements? Or
> via COPY statements? What sort of transaction grouping
> is involved?
>
> COPY is way faster than INSERT, and grouping plenty of updates
> into a single transaction is generally a "win."
>
> 2. What is the schema like? Does the table have a foreign key
> constraint? Does it have a bunch of indices?
>
> If there should eventually be lots of indices, it tends to be
> faster to create the table with none/minimal indices, and add
> indexes afterwards, as long as your "load" process can be trusted
> to not break "unique" constraints...
>
> If there is some secondary table with a foreign key constraint,
> and _that_ table is growing, it is possible that a sequential
> scan is being used to search the secondary table where, if you
> did an ANALYZE on that table, an index scan would be preferred
> once it grew to larger size...
>
>There isn't a particular reason for PostgreSQL to "hit a wall" upon
>seeing 200K records; I and coworkers routinely load database dumps
>that have millions of (sometimes pretty fat) records, and they don't
>"choke." That's true whether talking about loading things onto my
>(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID
>array, or higher end stuff involving high end SMP and EMC disk arrays.
>The latter obviously being orders of magnitude faster than desktop
>equipment :-).
>
>


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org, grupos(at)carvalhaes(dot)net
Subject: Re: Improve BULK insertion
Date: 2004-12-05 23:19:45
Message-ID: 200412051519.45328.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Rodrigo,

> 3. My transaction log configuration are : checkpoint_segments = 3  and
> checkpoint_timeout = 300 and my transaction logs are on the same disk .

Well, you need to move your transaction logs to another disk, and increase
them to a large number ... like 128, which is about 1GB (you'll need that
much disk space). Also, increase the checkpoint_timeout to minimize
checkpointing during the load; like, 1500.

> I know that I can increase the performance separating the transaction
> logs and making a RAID 5 array

Actually, RAID5, unless you're using > 5 disks, would make things slower.
Speeding writes up through RAID would require at least 6 drives, and probably
RAID 1+0.

> BUT I am really curious about WHY this
> performance is so poor and HOW can I try to improve on this actual
> machine because actualy this inserts are taking around 90 minutes!!!

Are you doing INSERTS and not COPY? If so, are you batching them in
transactions?

--
Josh Berkus
Aglio Database Solutions
San Francisco