Quick Links

concurrent COPY performance

Lists:	pgsql-hackers

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	concurrent COPY performance
Date:	2009-06-16 16:47:50
Message-ID:	4A37CCB6.4000702@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

I have been doing some bulk loading testing recently - mostly with a
focus on answering why we are "only" getting a (max of) cores/2(up to
around 8 cores even less with more) speedup using parallel restore.
What I found is that on some fast IO-subsystem we are CPU bottlenecked
on concurrent copy which is able to utilize WAL bypass (and scale up to
around cores/2) and performance without wal bypass is very bad.
In the WAL logged case we are only able to get a 50% speedup using the
second process already and we are never able to scale better than 3x (up
to 8 cores) and performance degrades even after that point.

the profile(loading the lineitem table from the DBT3 benchmark) for that
case looks fairly similiar to what I have seen in the past for
io-intensive concurrent workloads:

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
47546 15.4562 XLogInsert
39598 12.8725 DoCopy
33798 10.9870 CopyReadLine
14191 4.6132 DecodeNumber
12986 4.2215 heap_fill_tuple
12092 3.9308 pg_verify_mbstr_len
9553 3.1055 DecodeDate
9289 3.0197 InputFunctionCall
7972 2.5915 ParseDateTime
7324 2.3809 DecodeDateTime
7290 2.3698 pg_next_dst_boundary
7218 2.3464 heap_form_tuple
5385 1.7505 AllocSetAlloc
4779 1.5536 heap_compute_data_size
4367 1.4196 float4in
3903 1.2688 DetermineTimeZoneOffset
3603 1.1713 pg_mblen
3494 1.1358 pg_atoi
3461 1.1251 .plt
3428 1.1144 date2j
3416 1.1105 pg_mbstrlen_with_len

this is for 8 connections on an 8core/16 thread box. on higher
connection counts the server is >70% idle and showing even worse througput.

in the WAL bypass case I can actually get a performance improvement up
to 16 parallel connections however I also only get cores/2 maximum
throughput here too. We are actually able to max out the CPU on the
server in that case though (no idle time and iowait only in the single
digit range). Profiling that workload(loading around 1400000 rows/s) I get:

samples % symbol name
2462358 15.3353 DoCopy
2319677 14.4467 CopyReadLine
806269 5.0214 pg_verify_mbstr_len
782594 4.8739 DecodeNumber
712426 4.4369 heap_fill_tuple
642893 4.0039 DecodeDate
609313 3.7947 InputFunctionCall
476887 2.9700 ParseDateTime
456569 2.8435 pg_next_dst_boundary
435812 2.7142 DecodeDateTime
429733 2.6763 heap_form_tuple
361750 2.2529 heap_compute_data_size
301193 1.8758 AllocSetAlloc
268830 1.6742 float4in
238119 1.4830 DetermineTimeZoneOffset
229598 1.4299 pg_atoi
225169 1.4023 .plt
217817 1.3565 pg_mbstrlen_with_len
207041 1.2894 PageAddItem
200024 1.2457 pg_mblen
192237 1.1972 bpchar_input
181552 1.1307 date2j

for those interested I have some additional information up on:

http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html

Stefan

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: concurrent COPY performance
Date:	2009-06-16 21:33:12
Message-ID:	b42b73150906161433n5ce5d4f3r5bc9253bf8c545e2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 16, 2009 at 12:47 PM, Stefan
Kaltenbrunner<stefan(at)kaltenbrunner(dot)cc> wrote:
> Hi!
>
> I have been doing some bulk loading testing recently - mostly with a focus
> on answering why we are "only" getting a (max of) cores/2(up to around 8
> cores even less with more) speedup using parallel restore.
> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
> concurrent copy which is able to utilize WAL bypass (and scale up to around
> cores/2) and performance without wal bypass is very bad.
> In the WAL logged case we are only able to get a 50% speedup using the
> second process already and we are never able to scale better than 3x (up to
> 8 cores) and performance degrades even after that point.

how are you bypassing wal? do I read this properly that on your 8
core system you are getting 4x speedup with wal bypass and 3x speedup
without?

merlin

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: concurrent COPY performance
Date:	2009-06-16 21:49:41
Message-ID:	4A381375.2070906@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure wrote:
> On Tue, Jun 16, 2009 at 12:47 PM, Stefan
> Kaltenbrunner<stefan(at)kaltenbrunner(dot)cc> wrote:
>
>> Hi!
>>
>> I have been doing some bulk loading testing recently - mostly with a focus
>> on answering why we are "only" getting a (max of) cores/2(up to around 8
>> cores even less with more) speedup using parallel restore.
>> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
>> concurrent copy which is able to utilize WAL bypass (and scale up to around
>> cores/2) and performance without wal bypass is very bad.
>> In the WAL logged case we are only able to get a 50% speedup using the
>> second process already and we are never able to scale better than 3x (up to
>> 8 cores) and performance degrades even after that point.
>>
>
> how are you bypassing wal? do I read this properly that on your 8
> core system you are getting 4x speedup with wal bypass and 3x speedup
> without?
>

If a table is created or truncated in the same transaction that does the
load, and archiving is not on, the COPY is not WALed. That is why
parallel restore wraps the COPY in a transaction and precedes it with a
TRUNCATE if it created the table.

cheers

andrew

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Merlin Moncure" <mmoncure(at)gmail(dot)com>
Cc:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: concurrent COPY performance
Date:	2009-06-16 22:14:33
Message-ID:	4A37D2F90200002500027C39@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> If a table is created or truncated in the same transaction that does
> the load, and archiving is not on, the COPY is not WALed.

Slightly off topic, but possibly relevant to the overall process:
those are the same conditions under which I would love to see the
rows inserted with the hint bits showing successful commit and the
transaction ID showing frozen. We currently do a VACUUM FREEZE
ANALYZE after such a load, to avoid burdening random users with the
writes. It would be nice not to have to write all the pages again
right after a load.

-Kevin

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: concurrent COPY performance
Date:	2009-06-17 04:39:51
Message-ID:	4A387397.9090607@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure wrote:
> On Tue, Jun 16, 2009 at 12:47 PM, Stefan
> Kaltenbrunner<stefan(at)kaltenbrunner(dot)cc> wrote:
>> Hi!
>>
>> I have been doing some bulk loading testing recently - mostly with a focus
>> on answering why we are "only" getting a (max of) cores/2(up to around 8
>> cores even less with more) speedup using parallel restore.
>> What I found is that on some fast IO-subsystem we are CPU bottlenecked on
>> concurrent copy which is able to utilize WAL bypass (and scale up to around
>> cores/2) and performance without wal bypass is very bad.
>> In the WAL logged case we are only able to get a 50% speedup using the
>> second process already and we are never able to scale better than 3x (up to
>> 8 cores) and performance degrades even after that point.
>
> how are you bypassing wal? do I read this properly that on your 8
> core system you are getting 4x speedup with wal bypass and 3x speedup
> without?

The test is simply executing something like psql -c "BEGIN;TRUNCATE
lineitem1;COPY lineitem1 FROM ....;COMMIT;". in parallel with the source
file being hosted on a seperate array and primed into the OS buffercache.
The box has 8cores/16 threads actually - I get a 3x speedup up to using
8 processes without wal-bypass but on higher connection counts the
performances degraded.
Utilizing wal bypass I get near perfect scalability up to using 4
connections and a maximum speedup of ~8x by using 16 connections (ie all
threads)

Stefan

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: concurrent COPY performance
Date:	2009-06-19 20:50:55
Message-ID:	65937bea0906191350q74408e1dx5ad5884158cf8a04@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 17, 2009 at 3:44 AM, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov
> wrote:

> Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
> > If a table is created or truncated in the same transaction that does
> > the load, and archiving is not on, the COPY is not WALed.
>
> Slightly off topic, but possibly relevant to the overall process:
> those are the same conditions under which I would love to see the
> rows inserted with the hint bits showing successful commit and the
> transaction ID showing frozen. We currently do a VACUUM FREEZE
> ANALYZE after such a load, to avoid burdening random users with the
> writes. It would be nice not to have to write all the pages again
> right after a load.
>

+1 (if it is doable, that is).

Best regards,
--
Lets call it Postgres

EnterpriseDB http://www.enterprisedb.com

gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com
Mail sent from my BlackLaptop device