concurrent COPY performance

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: concurrent COPY performance
Date: 2009-06-16 16:47:50
Message-ID: 4A37CCB6.4000702@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

I have been doing some bulk loading testing recently - mostly with a
focus on answering why we are "only" getting a (max of) cores/2(up to
around 8 cores even less with more) speedup using parallel restore.
What I found is that on some fast IO-subsystem we are CPU bottlenecked
on concurrent copy which is able to utilize WAL bypass (and scale up to
around cores/2) and performance without wal bypass is very bad.
In the WAL logged case we are only able to get a 50% speedup using the
second process already and we are never able to scale better than 3x (up
to 8 cores) and performance degrades even after that point.

the profile(loading the lineitem table from the DBT3 benchmark) for that
case looks fairly similiar to what I have seen in the past for
io-intensive concurrent workloads:

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
47546 15.4562 XLogInsert
39598 12.8725 DoCopy
33798 10.9870 CopyReadLine
14191 4.6132 DecodeNumber
12986 4.2215 heap_fill_tuple
12092 3.9308 pg_verify_mbstr_len
9553 3.1055 DecodeDate
9289 3.0197 InputFunctionCall
7972 2.5915 ParseDateTime
7324 2.3809 DecodeDateTime
7290 2.3698 pg_next_dst_boundary
7218 2.3464 heap_form_tuple
5385 1.7505 AllocSetAlloc
4779 1.5536 heap_compute_data_size
4367 1.4196 float4in
3903 1.2688 DetermineTimeZoneOffset
3603 1.1713 pg_mblen
3494 1.1358 pg_atoi
3461 1.1251 .plt
3428 1.1144 date2j
3416 1.1105 pg_mbstrlen_with_len

this is for 8 connections on an 8core/16 thread box. on higher
connection counts the server is >70% idle and showing even worse througput.

in the WAL bypass case I can actually get a performance improvement up
to 16 parallel connections however I also only get cores/2 maximum
throughput here too. We are actually able to max out the CPU on the
server in that case though (no idle time and iowait only in the single
digit range). Profiling that workload(loading around 1400000 rows/s) I get:

samples % symbol name
2462358 15.3353 DoCopy
2319677 14.4467 CopyReadLine
806269 5.0214 pg_verify_mbstr_len
782594 4.8739 DecodeNumber
712426 4.4369 heap_fill_tuple
642893 4.0039 DecodeDate
609313 3.7947 InputFunctionCall
476887 2.9700 ParseDateTime
456569 2.8435 pg_next_dst_boundary
435812 2.7142 DecodeDateTime
429733 2.6763 heap_form_tuple
361750 2.2529 heap_compute_data_size
301193 1.8758 AllocSetAlloc
268830 1.6742 float4in
238119 1.4830 DetermineTimeZoneOffset
229598 1.4299 pg_atoi
225169 1.4023 .plt
217817 1.3565 pg_mbstrlen_with_len
207041 1.2894 PageAddItem
200024 1.2457 pg_mblen
192237 1.1972 bpchar_input
181552 1.1307 date2j

for those interested I have some additional information up on:

http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html

Stefan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-06-16 17:01:10 Re: machine-readable explain output
Previous Message Tom Lane 2009-06-16 16:44:36 Re: postmaster recovery and automatic restart suppression