Quick Links

8.4 open item: copy performance regression?

Lists:	pgsql-hackers

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	8.4 open item: copy performance regression?
Date:	2009-06-18 20:15:42
Message-ID:	4A3AA06E.3040109@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Any objections if I add:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php

to the (currently empty) list of open items for 8.4?

A 25-30% performance regression in our main bulk loading mechanism
should at least be explained before the release...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 21:20:08
Message-ID:	17281.1245360008@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> Any objections if I add:
> http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
> to the (currently empty) list of open items for 8.4?

I am unable to duplicate any slowdown on this test case. AFAICT
8.4 and 8.3 branch tip are about the same speed; if anything 8.4
is faster. Testing on x86_64 Fedora 10 ...

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 21:20:08
Message-ID:	4A3A69380200002500027DBF@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> wrote:

> A 25-30% performance regression in our main bulk loading mechanism
> should at least be explained before the release...

I think a performance regression of that magnitude merits holding up
a release to resolve.

Note that in a follow-up post showing the slow 8.4 copying on Linux,
the copy was 11.7% to 19.3% *faster* on 8.4 when the WAL writing was
suppressed:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00219.php

Extracting from that post:

# I can reproduce that on Linux(CentoS 5.3/x86_64, Nehalem Xeon E5530)

# on 8.4 I get:
#
# 3m59/4m01/3m56s runtime and a profile of
#
# samples % symbol name
# 636302 19.6577 XLogInsert
# 415510 12.8366 CopyReadLine

# on 8.3.7 I get 2m58s,2m54s,2m55s
#
# and a profile of:
#
# samples % symbol name
# 460966 16.2924 XLogInsert
# 307386 10.8643 CopyReadLine

# If I do the same test utilizing WAL bypass the picture changes:
#
# 8.3 runtimes:2m16,2min14s,2min22s

# 8.4 runtime: 2m1s,2m,1m59s

Is there a reason to believe that the XLogInsert part of this *only*
affects bulk loads?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 22:33:24
Message-ID:	4A3A7A640200002500027DC6@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> Any objections if I add:
>>
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
>> to the (currently empty) list of open items for 8.4?
>
> I am unable to duplicate any slowdown on this test case. AFAICT
> 8.4 and 8.3 branch tip are about the same speed; if anything 8.4
> is faster. Testing on x86_64 Fedora 10 ...

I just ran the specified test on:

Linux ATHENA 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux
Linux version 2.6.16.60-0.31-smp (geeko(at)buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

I did configure with a prefix and these options:

--enable-integer-datetimes --enable-debug --disable-nls --with-libxml

I did initdb with --no-locale

8.3.7
real 0m24.249s
real 0m24.054s
real 0m24.361s

8.4rc1
real 0m33.503s
real 0m34.198s
real 0m33.931s

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 22:39:13
Message-ID:	19295.1245364753@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I am unable to duplicate any slowdown on this test case.

> [ Kevin can ]

It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output? Also, try turning
off autovacuum in both cases, just to see if that's related.

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 23:06:40
Message-ID:	4A3A82300200002500027DCD@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> It'd be useful first off to figure out if it's a CPU or I/O issue.
> Is there any visible difference in vmstat output? Also, try turning
> off autovacuum in both cases, just to see if that's related.

Both took slightly longer with autovacuum off, but probably just
within the noise.

It's hard to interpret the vmstat output, largely because I chose to
run this on one of our biggest servers, which is not currently serving
an application, per se, but as a replication target, and this being
"off hours" is busy running the sync process to the source machines.
This involves generating md5 sums on both sides for blocks of rows,
which is pretty CPU-intensive. There is very little disk output from
that right now, pretty light on the disk reads, but keeping a few CPUs
pretty busy generating those md5 sums.

I've got to go keep an appointment, but I'll come back and see if I
can do more. For now, here's the raw vmstat, in case someone can pick
out info I'm missing:

-Kevin

Attachment	Content-Type	Size
vmstat.txt	text/plain	8.4 KB

From:	Kenneth Marshall <ktm(at)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 23:47:46
Message-ID:	20090618234746.GG23785@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 18, 2009 at 05:20:08PM -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> > Any objections if I add:
> > http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
> > to the (currently empty) list of open items for 8.4?
>
> I am unable to duplicate any slowdown on this test case. AFAICT
> 8.4 and 8.3 branch tip are about the same speed; if anything 8.4
> is faster. Testing on x86_64 Fedora 10 ...
>
> regards, tom lane
>

What is not clear from Stefen's function listing is how the 8.4
server could issue 33% more XLogInsert() and CopyReadLine()
calls than the 8.3.7 server using the same input file. That would
account for the slow down but now why it is happening.

Cheers,
Ken

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-18 23:59:01
Message-ID:	4A3A8E750200002500027DD3@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> I've got to go keep an appointment

Sorry about that. Back now. Anything else I can do to help with
this?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Kenneth Marshall" <ktm(at)rice(dot)edu>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 00:03:34
Message-ID:	4A3A8F860200002500027DD9@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kenneth Marshall <ktm(at)rice(dot)edu> wrote:

> What is not clear from Stefen's function listing is how the 8.4
> server could issue 33% more XLogInsert() and CopyReadLine()
> calls than the 8.3.7 server using the same input file.

I thought those were profiling numbers -- the number of times a timer
checked what was executing and found it in that method. Which
suggests that those two methods are probably slower now than in 8.3.7,
at least in some environments.

-Kevin

From:	Alan Li <ali(at)truviso(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 09:10:16
Message-ID:	782056770906190210s44034a73x7c9affed85a8715@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It doesn't look like it's related to autovacuum. I re-ran the test against
the two solaris boxes with autovacuum turned off and the results look about
the same.

8.3.7 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m43.662s
user 0m0.001s
sys 0m0.003s
real 0m43.565s
user 0m0.001s
sys 0m0.003s
real 0m43.742s
user 0m0.001s
sys 0m0.003s

8.4rc1 - Solaris 10 11/06 s10x_u3wos_10 X86
real 0m59.304s
user 0m0.001s
sys 0m0.003s
real 0m58.653s
user 0m0.001s
sys 0m0.003s
real 1m0.253s
user 0m0.001s
sys 0m0.003s

8.3.7 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m38.981s
user 0m0.002s
sys 0m0.004s
real 0m39.879s
user 0m0.002s
sys 0m0.004s
real 0m39.111s
user 0m0.002s
sys 0m0.004s

8.4rc1 - Solaris 10 8/07 s10x_u4wos_12b X86
real 0m50.647s
user 0m0.002s
sys 0m0.004s
real 0m49.453s
user 0m0.002s
sys 0m0.004s
real 0m49.725s
user 0m0.002s
sys 0m0.004s

Alan

On Thu, Jun 18, 2009 at 3:39 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I am unable to duplicate any slowdown on this test case.
>
> > [ Kevin can ]
>
> It'd be useful first off to figure out if it's a CPU or I/O issue.
> Is there any visible difference in vmstat output? Also, try turning
> off autovacuum in both cases, just to see if that's related.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
>

From:	Kenneth Marshall <ktm(at)rice(dot)edu>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 12:40:00
Message-ID:	20090619124000.GH23785@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Yes, you are right. I thought that they were absolute function
counts. The data makes more sense now.

Regards,
Ken

On Thu, Jun 18, 2009 at 07:03:34PM -0500, Kevin Grittner wrote:
> Kenneth Marshall <ktm(at)rice(dot)edu> wrote:
>
> > What is not clear from Stefen's function listing is how the 8.4
> > server could issue 33% more XLogInsert() and CopyReadLine()
> > calls than the 8.3.7 server using the same input file.
>
> I thought those were profiling numbers -- the number of times a timer
> checked what was executing and found it in that method. Which
> suggests that those two methods are probably slower now than in 8.3.7,
> at least in some environments.
>
> -Kevin
>

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 13:53:36
Message-ID:	4A3B9860.5030704@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kevin Grittner wrote:
>
> 8.3.7
> real 0m24.249s
> real 0m24.054s
> real 0m24.361s
>
> 8.4rc1
> real 0m33.503s
> real 0m34.198s
> real 0m33.931s
>
>
>

Ugh. This looks like a poster child case for a benchfarm ...

Is there any chance you guys could triangulate this a bit? Good initial
triangulation points might be the end of each commitfest. (I have a
vested interest in making sure COPY performance doesn't regress, since
it will affect parallel restore's speed in spades.)

cheers

andrew

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 13:59:02
Message-ID:	e51f66da0906190659k36bf08d1u225a83276299b470@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/19/09, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> Kevin Grittner wrote:
> > 8.3.7
> > real 0m24.249s
> > real 0m24.054s
> > real 0m24.361s
> > 8.4rc1
> > real 0m33.503s
> > real 0m34.198s
> > real 0m33.931s
>
> Ugh. This looks like a poster child case for a benchfarm ...
>
> Is there any chance you guys could triangulate this a bit? Good initial
> triangulation points might be the end of each commitfest. (I have a vested
> interest in making sure COPY performance doesn't regress, since it will
> affect parallel restore's speed in spades.)

git bisect?

--
marko

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 14:52:39
Message-ID:	4A3BA637.8060600@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:
>
>
> Kevin Grittner wrote:
>>
>> 8.3.7
>> real 0m24.249s
>> real 0m24.054s
>> real 0m24.361s
>>
>> 8.4rc1
>> real 0m33.503s
>> real 0m34.198s
>> real 0m33.931s
>>
>>
>>
>
> Ugh. This looks like a poster child case for a benchfarm ...

indeed...

>
> Is there any chance you guys could triangulate this a bit? Good initial
> triangulation points might be the end of each commitfest. (I have a
> vested interest in making sure COPY performance doesn't regress, since
> it will affect parallel restore's speed in spades.)

Maybe parallel restore is the issue why we haven't noticed this earlier.
The case that regressed this way is WAL logged COPY, COPY that can
bypass WAL (which typically happens in parallel restore now) is actually
a bit faster in my testing in 8.4.

I will try and see if I can figure out what caused the regression...

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 15:05:44
Message-ID:	7086.1245423944@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL. Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
if (!XLogArchivingActive() &&
(get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
#endif

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 15:51:01
Message-ID:	4A3BB3E5.6000001@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Just eyeing the code ... another thing we changed since 8.3 is to enable
> posix_fadvise() calls for WAL. Any of the complaints want to try diking
> out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?
>
> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
> if (!XLogArchivingActive() &&
> (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
> (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
> #endif

doesn't seem to cause any noticable difference for me...

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 17:49:31
Message-ID:	4A3BCFAB.8010109@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

ok after a bit of bisecting I'm happy to announce the winner of the contest:

http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php

this patch causes a 25-30% performance regression for WAL logged copy,
however in the WAL bypass case (maybe that was what got tested?) it
results in a 20% performance increase.

the raw numbers using the upthread posted minimal postgresql.conf are:

post patch/wal logged: 4min10s/4min19/4min12
post patch/wal bypass: 1m55s/1m58s/2m00
prepatch/wal logged: 2m55s/3min00/2m59
prepatch/wal bypass: 2m22s/2m18s/2m20s

Stefan

From:	Kenneth Marshall <ktm(at)rice(dot)edu>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 17:59:00
Message-ID:	20090619175900.GM23785@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 19, 2009 at 07:49:31PM +0200, Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>> Just eyeing the code ... another thing we changed since 8.3 is to enable
>> posix_fadvise() calls for WAL. Any of the complaints want to try diking
>> out this bit of code (near line 2580 in
>> src/backend/access/transam/xlog.c)?
>> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
>> if (!XLogArchivingActive() &&
>> (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
>> (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>> #endif
>
> ok after a bit of bisecting I'm happy to announce the winner of the
> contest:
>
> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
>
> this patch causes a 25-30% performance regression for WAL logged copy,
> however in the WAL bypass case (maybe that was what got tested?) it results
> in a 20% performance increase.
>
> the raw numbers using the upthread posted minimal postgresql.conf are:
>
> post patch/wal logged: 4min10s/4min19/4min12
> post patch/wal bypass: 1m55s/1m58s/2m00
> prepatch/wal logged: 2m55s/3min00/2m59
> prepatch/wal bypass: 2m22s/2m18s/2m20s
>
>
> Stefan
>

Great! Maybe just increasing the size of the BULKWRITE ring,
possibly as a function of the shared_memory is all that is
needed. 256kB is the currently coded ring_size in storage/buffer/freelist.c

Ken

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 18:11:14
Message-ID:	12228.1245435074@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> ok after a bit of bisecting I'm happy to announce the winner of the contest:
> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php

> this patch causes a 25-30% performance regression for WAL logged copy,
> however in the WAL bypass case (maybe that was what got tested?) it
> results in a 20% performance increase.

Hmm. What that patch actually changes is that it prevents a bulk insert
(ie COPY in) from trashing the entire shared-buffers arena. I think the
reason for the WAL correlation is that once it's filled the ring buffer,
creating new pages requires writing out old ones, and the
WAL-before-data rule means that the copy process has to block waiting
for WAL to go down to disk before it can write. When it's allowed to
use the whole arena there is more chance for some of that writing to be
done by the walwriter or bgwriter. But the details are going to depend
on the platform's CPU vs I/O balance, which no doubt explains why some
of us don't see it.

I don't think we want to revert that patch --- not trashing the whole
buffer arena seems like a Good Thing from a system-wide point of view,
even if it makes individual COPY operations go slower. However, we
could maybe play around with the tradeoffs a bit. In particular it
seems like it would be useful to experiment with different ring buffer
sizes. Could you try increasing the ring size allowed in
src/backend/storage/buffer/freelist.c for the BULKWRITE case

***************
*** 384,389 ****
--- 384,392 ----
case BAS_BULKREAD:
ring_size = 256 * 1024 / BLCKSZ;
break;
+ case BAS_BULKWRITE:
+ ring_size = 256 * 1024 / BLCKSZ;
+ break;
case BAS_VACUUM:
ring_size = 256 * 1024 / BLCKSZ;
break;

and see if maybe we can buy back most of the loss with not too much
of a ring size increase?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 19:06:27
Message-ID:	4A3BE1B3.3060107@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> ok after a bit of bisecting I'm happy to announce the winner of the contest:
>> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
>
>> this patch causes a 25-30% performance regression for WAL logged copy,
>> however in the WAL bypass case (maybe that was what got tested?) it
>> results in a 20% performance increase.
>
> Hmm. What that patch actually changes is that it prevents a bulk insert
> (ie COPY in) from trashing the entire shared-buffers arena. I think the
> reason for the WAL correlation is that once it's filled the ring buffer,
> creating new pages requires writing out old ones, and the
> WAL-before-data rule means that the copy process has to block waiting
> for WAL to go down to disk before it can write. When it's allowed to
> use the whole arena there is more chance for some of that writing to be
> done by the walwriter or bgwriter. But the details are going to depend
> on the platform's CPU vs I/O balance, which no doubt explains why some
> of us don't see it.

hmm - In my case both the CPU (an Intel E5530 Nehalem) and the IO
subsystem (8GB Fiberchannel connected NetApp with 4GB cache) are pretty
fast. and even with say fsync=off 8.4RC1 is only slightly faster than
8.3 with the same config and fsync=on so maybe there is a secondary
effect at play too.

>
> I don't think we want to revert that patch --- not trashing the whole
> buffer arena seems like a Good Thing from a system-wide point of view,
> even if it makes individual COPY operations go slower. However, we
> could maybe play around with the tradeoffs a bit. In particular it
> seems like it would be useful to experiment with different ring buffer
> sizes. Could you try increasing the ring size allowed in
> src/backend/storage/buffer/freelist.c for the BULKWRITE case
>
> ***************
> *** 384,389 ****
> --- 384,392 ----
> case BAS_BULKREAD:
> ring_size = 256 * 1024 / BLCKSZ;
> break;
> + case BAS_BULKWRITE:
> + ring_size = 256 * 1024 / BLCKSZ;
> + break;
> case BAS_VACUUM:
> ring_size = 256 * 1024 / BLCKSZ;
> break;
>
>
> and see if maybe we can buy back most of the loss with not too much
> of a ring size increase?

already started testing that once I found the offending commit.

256 * 1024 / BLCKSZ
4min10s/4min19/4min12

512 * 1024 / BLCKSZ
3min27s/3min32s

1024 * 1024 / BLCKSZ
3min14s/3min12s

2048 * 1024 / BLCKSZ
3min02/3min02

4096 * 1024 / BLCKSZ
2m59/2m58s

8192 * 1024 / BLCKSZ

2m59/2m59s

so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in
more or less the same performance that 8.3 had.

Stefan

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 20:39:03
Message-ID:	4A3BF767.6060207@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in
> more or less the same performance that 8.3 had.

Can some folks test this with different size COPYs? That's both
larger/smaller tables, and larger/smaller rows. We should also test
copy with large blob data.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-19 23:03:33
Message-ID:	1245452613.3895.483.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-06-19 at 14:11 -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> > ok after a bit of bisecting I'm happy to announce the winner of the contest:
> > http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
>
> > this patch causes a 25-30% performance regression for WAL logged copy,
> > however in the WAL bypass case (maybe that was what got tested?) it
> > results in a 20% performance increase.
>
> Hmm. What that patch actually changes is that it prevents a bulk insert
> (ie COPY in) from trashing the entire shared-buffers arena. I think the
> reason for the WAL correlation is that once it's filled the ring buffer,
> creating new pages requires writing out old ones, and the
> WAL-before-data rule means that the copy process has to block waiting
> for WAL to go down to disk before it can write. When it's allowed to
> use the whole arena there is more chance for some of that writing to be
> done by the walwriter or bgwriter. But the details are going to depend
> on the platform's CPU vs I/O balance, which no doubt explains why some
> of us don't see it.
>
> I don't think we want to revert that patch --- not trashing the whole
> buffer arena seems like a Good Thing from a system-wide point of view,
> even if it makes individual COPY operations go slower. However, we
> could maybe play around with the tradeoffs a bit. In particular it
> seems like it would be useful to experiment with different ring buffer
> sizes. Could you try increasing the ring size allowed in
> src/backend/storage/buffer/freelist.c for the BULKWRITE case

Yes, that's definitely the cause. The ring buffer technique was
prototyped in 8.3 and a ring of 32 blocks was found not to be
sufficient, which was one reason we didn't try to commit that then.
At the time, I also proposed a "filled buffer list" change to bufmgr to
allow bgwriter to preferentially target COPY's filled blocks, which
would also help with this effect.

You'll get better benefit from decreasing wal_writer_delay also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 02:03:56
Message-ID:	alpine.GSO.2.01.0906192124390.18922@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote:

> In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem
> (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast.

The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a
Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150).
The filesystem involved in that particular case is UFS, which I am
suspicious of as being part of why the problem is so pronounced there--the
default UFS tuning is pretty lightweight in terms of how much caching it
does. Not sure if Alan ran any tests against the big ZFS volume on the
other sever, I think all the results he posted were from the UFS boot
drive there too.

> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in more
> or less the same performance that 8.3 had.

It looks like it's a touch higher on our 8/07 system, it levels out at
8192 * (haven't checked the other one yet). I'm seeing this, using Alan's
original test set size (to make sure I was running exactly the same test)
and just grabbing the low/high from a set of 3 runs:

8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2)

256: 0m50.435s 0m51.944s (alan: 48.1 - 50.6)
1024: 0m47.299s 0m49.346s
4096: 0m43.725s 0m46.116s
8192: 0m40.715s 0m42.480s
16384: 0m41.318s 0m42.118s
65536: 0m41.675s 0m42.955s

I collected some iostat data here as well for some of the runs (the vmstat
data was harder to read, this being Solaris, and didn't seem to add
anything). I'm seeing lines like this with the default ring buffer of 256
*:

tty sd1 sd2 nfs1 cpu
tin tout kps tps serv kps tps serv kps tps serv us sy wt id
0 322 12 1 0 41371 2754 0 0 0 0 12 11 0 78
0 166 0 0 0 46246 3380 0 0 0 0 14 10 0 76
0 164 0 0 0 44874 3068 1 0 0 0 13 9 0 78

Obviously sd2 is where the database and source file are at. Basically,
about one core (out of four) tied up with a pretty even split of
user/system time. Using the highest ring size I tried, 65536 *, gives
lines that look like this:

tty sd1 sd2 nfs1 cpu
tin tout kps tps serv kps tps serv kps tps serv us sy wt id
0 163 0 0 0 56696 4291 0 0 0 0 20 12 0 68
0 166 0 0 0 58554 4542 0 0 0 0 21 12 0 67
0 168 0 0 0 56057 4308 0 0 0 0 21 12 0 67

So it seems like increasing the ring size helps saturate the disks better,
went from ~45MB/s to 57MB/s. What's kind of interesting is to compare
this against the 8.3.7 run, which is the fastest of them all, which I was
expecting to find had the highest write rate of them all;

tty sd1 sd2 nfs1 cpu
tin tout kps tps serv kps tps serv kps tps serv us sy wt id
0 83 0 0 0 47654 2121 0 0 0 0 23 8 0 69
0 240 0 0 0 44198 2150 1 0 0 0 19 8 0 73
0 83 0 0 0 37750 1110 1 0 0 0 21 6 0 72

That's actually doing less I/O per capita, which is why it's also got less
waiting for I/O%, but it's completing the most work. This makes me wonder
if in addition to the ring buffering issue, there isn't just plain more
writing per average completed transaction in 8.4 with this type of COPY.
This might explain why even with the expanded ring buffer, both Stephan
and my test runs still showed a bit of a regression against 8.3. I'm
guessing we have a second, smaller shooter here involved as well.

In any case, a bump of the ring multiplier to either 4096 or 8192
eliminates the worst of the regression here, good improvement so far.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 06:53:34
Message-ID:	alpine.GSO.2.01.0906200237340.15307@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 20 Jun 2009, Simon Riggs wrote:

> At the time, I also proposed a "filled buffer list" change to bufmgr to
> allow bgwriter to preferentially target COPY's filled blocks, which
> would also help with this effect.

One of the things I keep meaning to investigate is whether there's any
benefit to pushing buffers the background writer cleans onto the free
list, to speed up their subsequent allocation to needy backends. Both
this and the sequential scan ring buffer implementation might both benefit
from an approach where buffers are similarly pushed onto the free list
when they're no longer needed by the process that pulled them in.
Buffers could move from allocated->used->filled buffer list->free list in
the COPY buffer case, and allocated->used->free list when executing a
sequential scan.

That would seem to me to be a more robust general approach for solving
this class of problem than the whole ring buffer idea, which is a great
start but bound to run into situations where the size of the buffer just
isn't right anymore a few hardware generations down the road.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 08:22:20
Message-ID:	1245486140.3895.503.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-06-20 at 02:53 -0400, Greg Smith wrote:
> On Sat, 20 Jun 2009, Simon Riggs wrote:
>
> > At the time, I also proposed a "filled buffer list" change to bufmgr to
> > allow bgwriter to preferentially target COPY's filled blocks, which
> > would also help with this effect.
>
> One of the things I keep meaning to investigate is whether there's any
> benefit to pushing buffers the background writer cleans onto the free
> list, to speed up their subsequent allocation to needy backends. Both
> this and the sequential scan ring buffer implementation might both benefit
> from an approach where buffers are similarly pushed onto the free list
> when they're no longer needed by the process that pulled them in.
> Buffers could move from allocated->used->filled buffer list->free list in
> the COPY buffer case, and allocated->used->free list when executing a
> sequential scan.

The reason for not doing that would be that we don't know that the
blocks are free to use; we know very little about them. The longer we
leave them the more likely they are to be reused, so putting buffers
onto the freelist when they aren't actually free would likely make the
cache less effective, ISTM.

With filled buffers from COPY we have a very good clue that the buffers
will no longer be needed and can treat them differently from others.
Also, if we can get the bgwriter to do some of the work then we would
have the COPY process, bgwriter and WALwriter all working together on
the data loading. (We need to make WALwriter a little smarter also, so
that it can respond to changing WAL write workloads).

> That would seem to me to be a more robust general approach for solving
> this class of problem than the whole ring buffer idea, which is a great
> start but bound to run into situations where the size of the buffer just
> isn't right anymore a few hardware generations down the road.

The ring buffer optimises L2 cache, not the buffer cache in general. If
we put buffers back on the freelist that is the same as having one
global ring buffer, which would then spoil the benefit for L2 on
multi-CPU systems.

We don't see any L2 benefit with COPY yet, but it is potentially there
if we can overcome the stronger effect of the WAL costs. When the ring
buffer expands to minimise WAL overheads we also minimise benefit for
L2.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 08:54:54
Message-ID:	1245488094.3895.506.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-06-19 at 22:03 -0400, Greg Smith wrote:

> This makes me wonder if in addition to the ring buffering issue, there
> isn't just plain more writing per average completed transaction in 8.4
> with this type of COPY.

I would suggest that we check how much WAL has been written. There may
be a secondary effect or a different regression hidden in these results.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 11:10:37
Message-ID:	407d949e0906200410x5be9ab88u2c6d19307823f8f5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 20, 2009 at 9:22 AM, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>> That would seem to me to be a more robust general approach for solving
>> this class of problem than the whole ring buffer idea, which is a great
>> start but bound to run into situations where the size of the buffer just
>> isn't right anymore a few hardware generations down the road.
>
> The ring buffer optimises L2 cache, not the buffer cache in general. If
> we put buffers back on the freelist that is the same as having one
> global ring buffer, which would then spoil the benefit for L2 on
> multi-CPU systems.

Incidentally a "ring buffer" is something else. We're talking about a
clock-sweep algorithm.

The whole point of the clock-sweep algorithm is that it's an
approximated LRU but with no single point of contention like a list
pointer. Doing this would be undermining that advantage.

I don't understand what you mean by "size of the buffer" either. The
only real parameter in the clock sweep algorithm is how many distinct
counter values there are. The more values the closer to an LRU the
result is. it isn't really tied to hardware in any way, just the usage
pattern.

--
greg
http://mit.edu/~gsstark/resume.pdf

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 11:15:48
Message-ID:	4A3CC4E4.7050705@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith wrote:
> On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote:
>
>> In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem
>> (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast.
>
> The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a
> Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150).
> The filesystem involved in that particular case is UFS, which I am
> suspicious of as being part of why the problem is so pronounced
> there--the default UFS tuning is pretty lightweight in terms of how much
> caching it does. Not sure if Alan ran any tests against the big ZFS
> volume on the other sever, I think all the results he posted were from
> the UFS boot drive there too.
>
>> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in
>> more or less the same performance that 8.3 had.
>
> It looks like it's a touch higher on our 8/07 system, it levels out at
> 8192 * (haven't checked the other one yet). I'm seeing this, using
> Alan's original test set size (to make sure I was running exactly the
> same test) and just grabbing the low/high from a set of 3 runs:
>
> 8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2)
>
> 256: 0m50.435s 0m51.944s (alan: 48.1 - 50.6)
> 1024: 0m47.299s 0m49.346s
> 4096: 0m43.725s 0m46.116s
> 8192: 0m40.715s 0m42.480s
> 16384: 0m41.318s 0m42.118s
> 65536: 0m41.675s 0m42.955s

hmm interesting - I just did a bunch of runs using the lineitem table
from the DBT3 tests (loading 60M rows in each run) and the same config
Alan used.

8.4(postpatch - not RC1 but that one seems to behave exactly the same way)

lineitem1
256 9min38s
512 9min20s
1024 7m44.667s/7m45.342s
2048 7m15.500s/7m17.910s
4096 7m11.424s/7m13.276s
8192 6m43.203s/6m48.293s
16384 6m24.980s/6m24.116s
32768 6m20.753s/6m22.083s
65536 6m22.913s/6m22.449s
1048576 6m23.765s/6m24.645s

8.3

6m45.650s/6m44.781s

so on this workload the sweetspot seems to be much higher than on the
one with the narrower rows.

[...]
> That's actually doing less I/O per capita, which is why it's also got
> less waiting for I/O%, but it's completing the most work. This makes me
> wonder if in addition to the ring buffering issue, there isn't just
> plain more writing per average completed transaction in 8.4 with this
> type of COPY. This might explain why even with the expanded ring buffer,
> both Stephan and my test runs still showed a bit of a regression against
> 8.3. I'm guessing we have a second, smaller shooter here involved as well.

well yes I also suspect that there is some secondary effect at play here
and I believe I have seen the "more IO with 8.4" thing here too but I
have not actually paid enough attention yet to be sure.

>
> In any case, a bump of the ring multiplier to either 4096 or 8192
> eliminates the worst of the regression here, good improvement so far.

yeah with the above numbers I would say that 8192 should remove most if
not all of the regression. However it seems that we might have to make
this more dynamic in the future since the behaviour seems to depend on a
number of variables...

Stefan

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 11:46:17
Message-ID:	407d949e0906200446p25119cfeh6dc0c2a0e3ffa88f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 20, 2009 at 12:10 PM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>
> I don't understand what you mean by "size of the buffer" either.

Ok, having gone back and read the whole thread I understand the
context for that statement. Nevermind.

--
greg
http://mit.edu/~gsstark/resume.pdf

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 15:38:09
Message-ID:	4148.1245512289@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> and I believe I have seen the "more IO with 8.4" thing here too but I
> have not actually paid enough attention yet to be sure.

FSM/VM overhead maybe? I think COPY IN is setting the SKIP_FSM bit,
but I wonder if there's some vestigial overhead anyway.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 16:13:08
Message-ID:	1245514388.3895.519.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
> >
> > 8.3.7: 0m39.266s 0m43.269s (alan: 36.2 - 39.2)
> >
> > 8192: 0m40.715s 0m42.480s
> > 16384: 0m41.318s 0m42.118s
> > 65536: 0m41.675s 0m42.955s
>
> hmm interesting - I just did a bunch of runs using the lineitem table
> from the DBT3 tests (loading 60M rows in each run) and the same config
> Alan used.
>
> 8.4(postpatch - not RC1 but that one seems to behave exactly the same way)
>
> lineitem1
> 8192 6m43.203s/6m48.293s
> 16384 6m24.980s/6m24.116s
> 32768 6m20.753s/6m22.083s
> 65536 6m22.913s/6m22.449s
> 1048576 6m23.765s/6m24.645s

> 8.3
>
> 6m45.650s/6m44.781s

> so on this workload the sweetspot seems to be much higher than on the
> one with the narrower rows.

The rest of the patch should have had a greater effect on tables with
thinner rows. Your results match my expectations, though I read from
them that we should use 16384, since that provides some gain, not just a
cancellation of the regression.

I would suggest that we leave it as a tunable parameter in this release
and remove it again once we have clear evidence of how to set it. We are
unlikely to cover conclusively how to do this before we release 8.4.

> >
> > In any case, a bump of the ring multiplier to either 4096 or 8192
> > eliminates the worst of the regression here, good improvement so far.
>
> yeah with the above numbers I would say that 8192 should remove most if
> not all of the regression. However it seems that we might have to make
> this more dynamic in the future since the behaviour seems to depend on a
> number of variables...

I would be inclined to repeat this with multiple concurrent COPYs. We
may be able to improve on the numbers there, as well as get a
perspective on how to set parameter in real world.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 16:54:52
Message-ID:	alpine.GSO.2.01.0906201239460.7905@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 20 Jun 2009, Simon Riggs wrote:

> The reason for not doing that would be that we don't know that the
> blocks are free to use; we know very little about them. The longer we
> leave them the more likely they are to be reused, so putting buffers
> onto the freelist when they aren't actually free would likely make the
> cache less effective, ISTM.

This is the prevailing wisdom. The current design of the background
writer tries to make a reasonable guess as to how many buffers are going
to be allocated in the next bg_writer_delay period, and it
locates/generates clean buffers to fill that expected need. The idea
would be to put those buffers, which have a good chance of being allocated
by somebody else and therefore invalided shortly, onto the free list.

There's a certain amount of overhead the backends themselves wait for in
order to advance the clock sweep position to find the buffers they need,
and that adds some latency to them. I would guess there's some potential
for reducing that latency if the freelist was sometimes populated, which
right now it rarely is. While that would add contention onto the free
list, I wonder if it might still be less than the contention on advancing
the current buffer pool strategy point.

I don't want to go through the whole thing in this thread, just pointing
out some similarity with an item I'd already proposed onto the TODO
list--and that a good solution there might get rid of this whole "how big
do I make the ring buffer?" situation. Obviously actual measurements here
would trump any theorizing as to what works better, it's hard to get any
intuition about low-level optimizing given how complicated CPU caches are
nowadays.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-20 19:40:58
Message-ID:	9422.1245526858@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>> 8192 6m43.203s/6m48.293s
>> 16384 6m24.980s/6m24.116s
>> 32768 6m20.753s/6m22.083s
>> 65536 6m22.913s/6m22.449s
>> 1048576 6m23.765s/6m24.645s

> The rest of the patch should have had a greater effect on tables with
> thinner rows. Your results match my expectations, though I read from
> them that we should use 16384, since that provides some gain, not just a
> cancellation of the regression.

+1 for using 16384 (ie, max ring buffer size 16MB). Maybe even more.
It seems likely that other cases might have an even bigger issue than
is exhibited in the couple of test cases we have here, so we should
leave some margin for error. Also, there's code in there to limit the
ring buffer to 1/8th of shared buffers, so we don't have to worry about
trashing the whole buffer arena in small configurations. Any limitation
at all is still a step forward over previous releases as far as not
trashing the arena is concerned.

> I would suggest that we leave it as a tunable parameter in this release
> and remove it again once we have clear evidence of how to set it.

It's not going to become an exposed tunable. There is not evidence to
justify that, and we are down to the point of the release cycle where
any more than one-liner changes have to be looked at with considerable
reluctance.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 06:45:04
Message-ID:	alpine.GSO.2.01.0906210241020.7754@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 20 Jun 2009, Simon Riggs wrote:

> I would suggest that we check how much WAL has been written. There may
> be a secondary effect or a different regression hidden in these results.

What's the easiest way to do that? My first thought was to issue a
checkpoint before the test (which is a good idea to make each test
consistent anyway), save the output from pg_controldata, test, checkpoint,
and look at the control data again. This seems kind of clunky though, but
still better than trolling through the OS statistics for the data. Any
clever ideas for a better way to measure bytes of WAL written during a
particular chunk of code? We may need some sort of checkpoint/sync after
the test to get correct results, because I've noticed that the tests I run
sometimes continue writing out buffers for a few seconds after the test
time is finished.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 07:42:50
Message-ID:	4A3DE47A.6000808@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>>> 8192 6m43.203s/6m48.293s
>>> 16384 6m24.980s/6m24.116s
>>> 32768 6m20.753s/6m22.083s
>>> 65536 6m22.913s/6m22.449s
>>> 1048576 6m23.765s/6m24.645s
>
>> The rest of the patch should have had a greater effect on tables with
>> thinner rows. Your results match my expectations, though I read from
>> them that we should use 16384, since that provides some gain, not just a
>> cancellation of the regression.
>
> +1 for using 16384 (ie, max ring buffer size 16MB). Maybe even more.
> It seems likely that other cases might have an even bigger issue than
> is exhibited in the couple of test cases we have here, so we should
> leave some margin for error. Also, there's code in there to limit the
> ring buffer to 1/8th of shared buffers, so we don't have to worry about
> trashing the whole buffer arena in small configurations. Any limitation
> at all is still a step forward over previous releases as far as not
> trashing the arena is concerned.

+1. You might get away with a smaller ring with narrow tables, where
writing 16MB of data produces more than 16MB of WAL, but I don't think
it can ever be the other way round. Leaving a little bit of room for
error doesn't seem like a bad idea, though.

IIRC we experimented with an auto-tuning ring size when we worked on the
original ring buffer patch. The idea is that you start with a small
ring, and enlarge it in StrategyRejectBuffer. But that seems too risky
for 8.4.

I wonder if using the small ring showed any benefit when the COPY is not
WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
so the small ring might have some L2 cache benefits.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 08:28:40
Message-ID:	4A3DEF38.8060503@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>>>> 8192 6m43.203s/6m48.293s
>>>> 16384 6m24.980s/6m24.116s
>>>> 32768 6m20.753s/6m22.083s
>>>> 65536 6m22.913s/6m22.449s
>>>> 1048576 6m23.765s/6m24.645s
>>
>>> The rest of the patch should have had a greater effect on tables with
>>> thinner rows. Your results match my expectations, though I read from
>>> them that we should use 16384, since that provides some gain, not just a
>>> cancellation of the regression.
>>
>> +1 for using 16384 (ie, max ring buffer size 16MB). Maybe even more.
>> It seems likely that other cases might have an even bigger issue than
>> is exhibited in the couple of test cases we have here, so we should
>> leave some margin for error. Also, there's code in there to limit the
>> ring buffer to 1/8th of shared buffers, so we don't have to worry about
>> trashing the whole buffer arena in small configurations. Any limitation
>> at all is still a step forward over previous releases as far as not
>> trashing the arena is concerned.
>
> +1. You might get away with a smaller ring with narrow tables, where
> writing 16MB of data produces more than 16MB of WAL, but I don't think
> it can ever be the other way round. Leaving a little bit of room for
> error doesn't seem like a bad idea, though.

yeah 16MB seems like the best choice given the available data and how
far we are into the release cycle.

>
> IIRC we experimented with an auto-tuning ring size when we worked on the
> original ring buffer patch. The idea is that you start with a small
> ring, and enlarge it in StrategyRejectBuffer. But that seems too risky
> for 8.4.

agreed.

>
> I wonder if using the small ring showed any benefit when the COPY is not
> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
> so the small ring might have some L2 cache benefits.

I did some limited testing on that but I was unable to measure any
significant effect - especially since the difference between wal-logged
and not is rather small for a non-parallel COPY (ie in the above example
you get around 6m20s runtime for wal-logged and ~5m40s in the other case).

Stefan

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 08:52:42
Message-ID:	1245574362.31430.18.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
> I did some limited testing on that but I was unable to measure any
> significant effect - especially since the difference between
> wal-logged and not is rather small for a non-parallel COPY (ie in the
> above example you get around 6m20s runtime for wal-logged and ~5m40s
> in the other case).

This is a common confusion for small tests.

Non-WAL logged case causes all buffers to be written to disk at end of
COPY. This is roughly the same size as the volume of WAL written. In
logged case we do not write data blocks, they get written at next
checkpoint. So the reduction in I/O is not apparent, since during the
period of the test the I/O is about the same in both cases and less I/O
in the non-WAL logged case. On longer tests the difference shows more
clearly because the data blocks start to migrate out of shared buffers
while the COPY is still running, effecting the test results.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 08:53:35
Message-ID:	1245574415.31430.21.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2009-06-21 at 02:45 -0400, Greg Smith wrote:
> On Sat, 20 Jun 2009, Simon Riggs wrote:
>
> > I would suggest that we check how much WAL has been written. There may
> > be a secondary effect or a different regression hidden in these results.
>
> What's the easiest way to do that?

pg_current_xlog_insert_location()

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 09:17:01
Message-ID:	4A3DFA8D.2070805@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
>> I did some limited testing on that but I was unable to measure any
>> significant effect - especially since the difference between
>> wal-logged and not is rather small for a non-parallel COPY (ie in the
>> above example you get around 6m20s runtime for wal-logged and ~5m40s
>> in the other case).
>
> This is a common confusion for small tests.
>
> Non-WAL logged case causes all buffers to be written to disk at end of
> COPY. This is roughly the same size as the volume of WAL written. In
> logged case we do not write data blocks, they get written at next
> checkpoint. So the reduction in I/O is not apparent, since during the
> period of the test the I/O is about the same in both cases and less I/O
> in the non-WAL logged case. On longer tests the difference shows more
> clearly because the data blocks start to migrate out of shared buffers
> while the COPY is still running, effecting the test results.

I was actually testing with and without explicit CHECKPOINTing
before/after the load(and also with longer runs) too - the difference is
negligible especially with only one process involved.
I think the difference is simply not that large because we are still
mostly CPU bound within COPY on reasonably fast IO-subsystems.

Stefan

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 10:48:08
Message-ID:	4A3E0FE8.6000508@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner wrote:
> Simon Riggs wrote:
>> On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
>>> I did some limited testing on that but I was unable to measure any
>>> significant effect - especially since the difference between
>>> wal-logged and not is rather small for a non-parallel COPY (ie in the
>>> above example you get around 6m20s runtime for wal-logged and ~5m40s
>>> in the other case).
>>
>> This is a common confusion for small tests.
>>
>> Non-WAL logged case causes all buffers to be written to disk at end of
>> COPY. This is roughly the same size as the volume of WAL written. In
>> logged case we do not write data blocks, they get written at next
>> checkpoint. So the reduction in I/O is not apparent, since during the
>> period of the test the I/O is about the same in both cases and less I/O
>> in the non-WAL logged case. On longer tests the difference shows more
>> clearly because the data blocks start to migrate out of shared buffers
>> while the COPY is still running, effecting the test results.
>
> I was actually testing with and without explicit CHECKPOINTing
> before/after the load(and also with longer runs) too - the difference is
> negligible especially with only one process involved.
> I think the difference is simply not that large because we are still
> mostly CPU bound within COPY on reasonably fast IO-subsystems.

hmm to further demonstrate that I just did some testing(same config as
before and the 16MB for the buffer) by loading those 60M rows into a
20GB ramdisk instead of the SAN(with a CHECKPOINT before and after).

this results in the following "improvements":

16384:
wal bypass: 5min40s -> 5min10s (~9%)
wal logged: 6min20s -> 6min8s (~3%)

vmstat 5 output shows that the system is in fact CPU bound (ie using ~6%
which is more or less a full core on a 16 core box) and not doing
anything IO-wise.

r b swpd free buff cache si so bi bo in cs us sy
id wa st
1 0 0 19010444 118648 15415684 0 0 0 6 1068 2151
6 0 94 0 0
1 0 0 18870448 118652 15555204 0 0 0 6 1069 2142
6 0 94 0 0
1 0 0 18730568 118684 15694872 0 0 0 185 1080 2151
6 0 94 0 0
1 0 0 18591236 118692 15834516 0 0 0 19 1072 2144
6 0 94 0 0
1 0 0 18451472 118696 15973532 0 0 0 46 1073 2152
6 0 94 0 0
1 0 0 18311720 118704 16113204 0 0 0 7 1059 2136
6 0 94 0 0
1 0 0 18171968 118704 16252944 0 0 0 0 1077 2171
6 0 94 0 0
1 0 0 18032088 118712 16392300 0 0 0 54 1062 2138
6 0 94 0 0
1 0 0 17891716 118720 16532060 0 0 0 8 1078 2176
6 0 94 0 0

So I do think that IO is in fact not too significant for this kind of
testing and we still have ways to go in terms of CPU efficiency in COPY.

Stefan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 15:16:35
Message-ID:	603c8f070906210816x10cfca35m6ceb5aa05f56b7dd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 21, 2009 at 6:48 AM, Stefan
Kaltenbrunner<stefan(at)kaltenbrunner(dot)cc> wrote:
> So I do think that IO is in fact not too significant for this kind of
> testing and we still have ways to go in terms of CPU efficiency in COPY.

It would be interesting to see some gprof or oprofile output from that
test. I went back and dug up the results that I got when I profiled
this patch during initial development, and my version of the patch
applied, the profile looked like this on my system:

% cumulative self self total
time seconds seconds calls s/call s/call name
14.48 0.85 0.85 1 0.85 5.47 DoCopy
10.05 1.44 0.59 10000001 0.00 0.00 CopyReadLine
5.62 1.77 0.33 10000039 0.00 0.00 PageAddItem
5.11 2.07 0.30 10400378 0.00 0.00 LWLockRelease
4.68 2.35 0.28 10000013 0.00 0.00 heap_insert
4.34 2.60 0.26 10000012 0.00 0.00 heap_formtuple
3.83 2.83 0.23 10356158 0.00 0.00 LWLockAcquire
3.83 3.05 0.23 10000054 0.00 0.00 MarkBufferDirty
3.32 3.25 0.20 10000013 0.00 0.00 RelationGetBufferForTuple
3.07 3.43 0.18 10000005 0.00 0.00 pg_verify_mbstr_len
2.90 3.60 0.17 10000002 0.00 0.00 CopyGetData
2.73 3.76 0.16 20000030 0.00 0.00 enlargeStringInfo
2.73 3.92 0.16 20000014 0.00 0.00 pq_getbytes
2.04 4.04 0.12 10000000 0.00 0.00 InputFunctionCall

...but this might not be very representative, since I think I may have
tested it on a single-column table. It would be interesting to see
some other results.

Simon had the idea of further improving performance by keeping the
current buffer locked (this patch just kept it pinned, but not
locked), but I didn't see an obvious clean design for that. Heikki
also had a patch for speeding up copy, but it got dropped for 8.4 due
to time constraints.

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 15:31:54
Message-ID:	12404.1245598314@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It would be interesting to see some gprof or oprofile output from that
> test. I went back and dug up the results that I got when I profiled
> this patch during initial development, and my version of the patch
> applied, the profile looked like this on my system:

Were you testing with a temp table? The lack of XLogInsert in your
profile is striking. Stefan's results upthread had it at the top,
and I got more or less the same thing here (didn't keep my numbers
unfortunately).

> Simon had the idea of further improving performance by keeping the
> current buffer locked (this patch just kept it pinned, but not
> locked), but I didn't see an obvious clean design for that.

The potential for deadlock seems sufficient reason not to do that.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 15:35:05
Message-ID:	4A3E5329.2000105@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It would be interesting to see some gprof or oprofile output from that
>> test. I went back and dug up the results that I got when I profiled
>> this patch during initial development, and my version of the patch
>> applied, the profile looked like this on my system:
>
> Were you testing with a temp table? The lack of XLogInsert in your
> profile is striking. Stefan's results upthread had it at the top,
> and I got more or less the same thing here (didn't keep my numbers
> unfortunately).

I guess that profile was for the wal bypass case and it looks fairly
similiar to what I get here(lineitem table into tmpfs - though only 30M
rows this time to keep VM pressure low):

samples % symbol name
286197 17.1997 DoCopy
232958 14.0002 CopyReadLine
99762 5.9954 DecodeNumber
92751 5.5741 heap_fill_tuple
84439 5.0746 pg_verify_mbstr_len
65421 3.9316 InputFunctionCall
62502 3.7562 DecodeDate
53565 3.2191 heap_form_tuple
47731 2.8685 ParseDateTime
41206 2.4764 DecodeDateTime
39936 2.4001 pg_next_dst_boundary
36093 2.1691 AllocSetAlloc
33967 2.0413 heap_compute_data_size
29921 1.7982 float4in
27227 1.6363 DetermineTimeZoneOffset
25622 1.5398 pg_atoi
24703 1.4846 pg_mblen
24495 1.4721 .plt
23912 1.4371 pg_mbstrlen_with_len
23448 1.4092 bpchar_input
20033 1.2039 date2j
16331 0.9815 date_in
15684 0.9426 bpcharin
14819 0.8906 PageAddItem
14261 0.8571 ValidateDate

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 15:52:48
Message-ID:	12771.1245599568@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I wonder if using the small ring showed any benefit when the COPY is not
> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
> so the small ring might have some L2 cache benefits.

I think the notion that we might get a cache win from a smaller ring
is an illusion. We're not expecting to go back and re-read from a
previously filled page in this scenario. In any case, all of the
profiling results so far show that the CPU bottlenecks are elsewhere.
Until we can squeeze an order of magnitude out of COPY's data parsing
and/or XLogInsert, any possible cache effects will be down in the noise.

So to my mind, the only question left to answer (at least for the 8.4
cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
Right at the moment I'd be satisfied with 16, but I wonder whether there
are scenarios where 32MB would show a significant advantage.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 16:07:00
Message-ID:	603c8f070906210907h4388a84fxa89340884852c7a6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 21, 2009 at 11:31 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It would be interesting to see some gprof or oprofile output from that
>> test. I went back and dug up the results that I got when I profiled
>> this patch during initial development, and my version of the patch
>> applied, the profile looked like this on my system:
>
> Were you testing with a temp table?

No.

> The lack of XLogInsert in your
> profile is striking. Stefan's results upthread had it at the top,
> and I got more or less the same thing here (didn't keep my numbers
> unfortunately).

As Stephen guessed, I created the table in the same transaction that I
inserted into it...

>> Simon had the idea of further improving performance by keeping the
>> current buffer locked (this patch just kept it pinned, but not
>> locked), but I didn't see an obvious clean design for that.
>
> The potential for deadlock seems sufficient reason not to do that.

Yep, that was the problem. *thinks*

I think we had the idea of buffering up enough tuples to fill a page
(estimating conservatively so as to make sure we actually fill it),
and then inserting them all at once. But I'm not sure how much
trouble that causes in terms of the timing of inserting index entries
and firing of after row insert triggers. If the command ID doesn't
change, it seems like it might be OK. Or at worst, even if the
optimization would only work in cases where there are no after row
triggers, that would still be useful to a lot of people, I think.

I haven't really spent much time on this angle of attack and
completely confess to not understanding all of the issues...

...Robert

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 16:11:36
Message-ID:	603c8f070906210911mf797cfbxec2fc70d8a8e628f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I wonder if using the small ring showed any benefit when the COPY is not
>> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
>> so the small ring might have some L2 cache benefits.
>
> I think the notion that we might get a cache win from a smaller ring
> is an illusion. We're not expecting to go back and re-read from a
> previously filled page in this scenario. In any case, all of the
> profiling results so far show that the CPU bottlenecks are elsewhere.
> Until we can squeeze an order of magnitude out of COPY's data parsing
> and/or XLogInsert, any possible cache effects will be down in the noise.
>
> So to my mind, the only question left to answer (at least for the 8.4
> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
> Right at the moment I'd be satisfied with 16, but I wonder whether there
> are scenarios where 32MB would show a significant advantage.

Even 32MB is not that much. It seems to me that in any realistic
production scenario you're going to have at least half a gig of shared
buffers, so we're really talking about at most one-sixteenth of the
shared buffer arena, and possibly quite a bit less. I think that's
pretty conservative.

...Robert

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 16:15:14
Message-ID:	407d949e0906210915i3c8e2682kf8d8939ee38efd2d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 21, 2009 at 5:07 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>
> I think we had the idea of buffering up enough tuples to fill a page
> (estimating conservatively so as to make sure we actually fill it),
> and then inserting them all at once. But I'm not sure how much
> trouble that causes in terms of the timing of inserting index entries
> and firing of after row insert triggers. If the command ID doesn't
> change, it seems like it might be OK. Or at worst, even if the
> optimization would only work in cases where there are no after row
> triggers, that would still be useful to a lot of people, I think.
>
> I haven't really spent much time on this angle of attack and
> completely confess to not understanding all of the issues...

There was some discussion of doing this in general for all inserts
inside the indexam. The btree indexam could buffer up any inserts done
within the transaction and keep them in an in-memory btree. Any
lookups done within the transaction first look up in the in-memory
tree then the disk. If the in-memory buffer fills up then we flush
them to the index.

The reason this is tempting is that we could then insert them all in a
single index-merge operation which would often be more efficient than
retail inserts.

--
greg
http://mit.edu/~gsstark/resume.pdf

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 16:38:34
Message-ID:	14308.1245602314@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> There was some discussion of doing this in general for all inserts
> inside the indexam. The btree indexam could buffer up any inserts done
> within the transaction and keep them in an in-memory btree. Any
> lookups done within the transaction first look up in the in-memory
> tree then the disk. If the in-memory buffer fills up then we flush
> them to the index.

> The reason this is tempting is that we could then insert them all in a
> single index-merge operation which would often be more efficient than
> retail inserts.

That's not gonna work for a unique index, which unfortunately is a
pretty common case ...

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 16:53:56
Message-ID:	4A3E65A4.5030101@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I wonder if using the small ring showed any benefit when the COPY is not
>> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
>> so the small ring might have some L2 cache benefits.
>
> I think the notion that we might get a cache win from a smaller ring
> is an illusion. We're not expecting to go back and re-read from a
> previously filled page in this scenario. In any case, all of the
> profiling results so far show that the CPU bottlenecks are elsewhere.
> Until we can squeeze an order of magnitude out of COPY's data parsing
> and/or XLogInsert, any possible cache effects will be down in the noise.

we also need to take a serious look at our locking overhead - WAL logged
COPY is already taking a significant performance hit with just a second
process running in parallel(into a seperate table).
I just did some testing using those 16MB buffer, the upthread mentioned
postgresql.conf and a 20GB tmpfs.

The following copying 3M rows(each) into a seperate table of the same
database.

processes total time(s) rows/s rows/s - per core

1 17.5 171428.57 171428.57
2 20.8 288461.54 144230.77
4 25.5 470588.24 117647.06
6 31.1 578778.14 96463.02
8 41.4 579710.14 72463.77
10 63 476190.48 47619.05
12 89 404494.38 33707.87
14 116 362068.97 25862.07
16 151 317880.79 19867.55

the higher the process count the more erratic the box behaves - it will
show a very high context switch rate (between 300000 and 400000/s) a
large amount of idle time (>60%!).

example vmstat 5 output for the 12 process test:

7 0 0 21654500 45436 12932516 0 0 0 3 1079 336941
34 7 59 0 0
6 0 0 21354044 45444 13232444 0 0 0 52 1068 341836
35 7 59 0 0
4 0 0 21053832 45452 13531472 0 0 0 23 1082 341672
35 7 59 0 0
9 0 0 20751136 45460 13833336 0 0 0 41 1063 344117
35 7 59 0 0
6 0 0 20443856 45468 14138116 0 0 0 14 1079 349398
35 7 58 0 0
8 0 0 20136592 45476 14444644 0 0 0 8 1060 351569
35 7 58 0 0
10 0 0 19836600 45484 14743320 0 0 0 144 1086 341533
35 7 58 0 0
7 0 0 19540472 45492 15039616 0 0 0 94 1067 337731
36 7 58 0 0
2 0 0 19258244 45500 15321156 0 0 0 15 1079 311394
34 6 60 0 0

Stefan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 17:00:09
Message-ID:	14771.1245603609@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> The following copying 3M rows(each) into a seperate table of the same
> database.

Is this with WAL, or bypassing WAL? Given what we've already seen,
a lot of contention for WALInsertLock wouldn't surprise me much here.
It should be possible to bypass that though.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 17:37:28
Message-ID:	4A3E6FD8.9020607@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So to my mind, the only question left to answer (at least for the 8.4
>> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
>> Right at the moment I'd be satisfied with 16, but I wonder whether there
>> are scenarios where 32MB would show a significant advantage.
>
> Even 32MB is not that much. It seems to me that in any realistic
> production scenario you're going to have at least half a gig of shared
> buffers, so we're really talking about at most one-sixteenth of the
> shared buffer arena, and possibly quite a bit less. I think that's
> pretty conservative.

I was going to say that since we flush the WAL every 16MB anyway (at
every XLOG file switch), you shouldn't see any benefit with larger ring
buffers, since to fill 16MB of data you're not going to write more than
16MB WAL. But then I realized that that's not true if you have an
unusually low fillfactor. If you only fill each page say 50% full,
you're going to use 32MB worth of data pages but only write 16MB of WAL.
And maybe you could have a situation like that with very wide rows as
well, with wasted space on each page that's not enough to store one more
row.

Could you test that scenario?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 17:42:50
Message-ID:	15469.1245606170@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I was going to say that since we flush the WAL every 16MB anyway (at
> every XLOG file switch), you shouldn't see any benefit with larger ring
> buffers, since to fill 16MB of data you're not going to write more than
> 16MB WAL.

I'm not convinced that WAL segment boundaries are particularly relevant
to this. The unit of flushing is an 8K page, not a segment.

I wonder though whether the wal_buffers setting interacts with the
ring size. Has everyone who's tested this used the same 16MB
wal_buffers setting as in Alan's original scenario?

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 17:58:32
Message-ID:	4A3E74C8.8070208@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
>> The following copying 3M rows(each) into a seperate table of the same
>> database.
>
> Is this with WAL, or bypassing WAL? Given what we've already seen,
> a lot of contention for WALInsertLock wouldn't surprise me much here.
> It should be possible to bypass that though.

this was with WAL. here are the numbers for bypass:

processes total time(s) rows/s rows/s - per core

1 15 200000 200000
2 15.1 397350.99 198675.5
4 15.2 789473.68 197368.42
6 15.3 1176470.59 196078.43
8 16.2 1481481.48 185185.19
10 21.9 1369863.01 136986.3
12 22.7 1585903.08 132158.59
14 25.2 1666666.67 119047.62
16 27.9 1720430.11 107526.88

runtimes grew very short here but the numbers still seem sane and if you
compare them to what I did on real storage(though without the 16MB
ringbuffer fix!):

http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html

you will see that for a single core there is almost no performance
difference between ramdisk and real disk, at 8 cores there is the
largest gap at around 45% but on 16 cores we are down to a mere 20%
difference.
All in all it seems that we have a big locking issue with WALInsertLock
and even with that removed we are mostly CPU limited and not IO limited
for COPY.

Stefan

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 18:26:02
Message-ID:	1245608762.31430.29.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2009-06-21 at 12:38 -0400, Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > There was some discussion of doing this in general for all inserts
> > inside the indexam. The btree indexam could buffer up any inserts done
> > within the transaction and keep them in an in-memory btree. Any
> > lookups done within the transaction first look up in the in-memory
> > tree then the disk. If the in-memory buffer fills up then we flush
> > them to the index.
>
> > The reason this is tempting is that we could then insert them all in a
> > single index-merge operation which would often be more efficient than
> > retail inserts.
>
> That's not gonna work for a unique index, which unfortunately is a
> pretty common case ...

I think it can. If we fail on a unique index we fail. We aren't
expecting that, else we wouldn't be using COPY. So I reckon its
acceptable to load a whole block of rows and then load a whole blocks's
worth of index entries. The worst thing that can happen is we insert a
few extra heap rows that get aborted, which is small in comparison to
the potential gains from buffering.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-21 18:29:09
Message-ID:	1245608949.31430.33.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2009-06-21 at 20:37 +0300, Heikki Linnakangas wrote:
> Robert Haas wrote:
> > On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> So to my mind, the only question left to answer (at least for the 8.4
> >> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
> >> Right at the moment I'd be satisfied with 16, but I wonder whether there
> >> are scenarios where 32MB would show a significant advantage.
> >
> > Even 32MB is not that much. It seems to me that in any realistic
> > production scenario you're going to have at least half a gig of shared
> > buffers, so we're really talking about at most one-sixteenth of the
> > shared buffer arena, and possibly quite a bit less. I think that's
> > pretty conservative.
>
> I was going to say that since we flush the WAL every 16MB anyway (at
> every XLOG file switch), you shouldn't see any benefit with larger ring
> buffers, since to fill 16MB of data you're not going to write more than
> 16MB WAL. But then I realized that that's not true if you have an
> unusually low fillfactor. If you only fill each page say 50% full,
> you're going to use 32MB worth of data pages but only write 16MB of WAL.
> And maybe you could have a situation like that with very wide rows as
> well, with wasted space on each page that's not enough to store one more
> row.

If walwriter is working correctly then it should be writing and fsyncing
WAL, while the COPY process just inserts WAL. I don't see that as an
argument to limit us to 16MB. But I take your point as being an argument
in favour of that as a consensus value for us to choose.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Alan Li <ali(at)truviso(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 06:42:40
Message-ID:	782056770906212342n2e04c3b7k5987b6e59a1f8374@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tom,

How much concern is there for the contention for use cases where the WAL
can't be bypassed?

Thanks, Alan

On Sun, Jun 21, 2009 at 10:00 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> > The following copying 3M rows(each) into a seperate table of the same
> > database.
>
> Is this with WAL, or bypassing WAL? Given what we've already seen,
> a lot of contention for WALInsertLock wouldn't surprise me much here.
> It should be possible to bypass that though.
>
> regards, tom lane
>
>
>

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 07:52:47
Message-ID:	4A3F384F.6080908@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I was going to say that since we flush the WAL every 16MB anyway (at
>> every XLOG file switch), you shouldn't see any benefit with larger ring
>> buffers, since to fill 16MB of data you're not going to write more than
>> 16MB WAL.
>
> I'm not convinced that WAL segment boundaries are particularly relevant
> to this. The unit of flushing is an 8K page, not a segment.

We fsync() the old WAL segment every time we switch to a new WAL
segment. That's what I meant by "flush".

If the walwriter is keeping up, it will fsync() the WAL more often, but
16MB is the maximum distance between fsync()s.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 08:20:51
Message-ID:	1245658851.31430.45.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-06-22 at 10:52 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> >> I was going to say that since we flush the WAL every 16MB anyway (at
> >> every XLOG file switch), you shouldn't see any benefit with larger ring
> >> buffers, since to fill 16MB of data you're not going to write more than
> >> 16MB WAL.
> >
> > I'm not convinced that WAL segment boundaries are particularly relevant
> > to this. The unit of flushing is an 8K page, not a segment.
>
> We fsync() the old WAL segment every time we switch to a new WAL
> segment. That's what I meant by "flush".
>
> If the walwriter is keeping up, it will fsync() the WAL more often, but
> 16MB is the maximum distance between fsync()s.

Yes, but the fsync is performed by the process that writes the WAL, not
necessarily by the process that inserts the WAL. In perfect balance, an
inserter-of-WAL could insert an infinite amount of WAL and never need to
fsync the WAL. So the question is are we in balance between WALWriter
and COPY?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alan Li <ali(at)truviso(dot)com>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:16:15
Message-ID:	16210.1245680175@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alan Li <ali(at)truviso(dot)com> writes:
> How much concern is there for the contention for use cases where the WAL
> can't be bypassed?

If you mean "is something going to be done about it in 8.4", the
answer is "no". This is a pre-existing issue that there is no simple
fix for.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:20:51
Message-ID:	4A3F9343.4080409@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Alan Li <ali(at)truviso(dot)com> writes:
>
>> How much concern is there for the contention for use cases where the WAL
>> can't be bypassed?
>>
>
> If you mean "is something going to be done about it in 8.4", the
> answer is "no". This is a pre-existing issue that there is no simple
> fix for.
>
>
>

I thought he was asking if we intend to provide for WAL bypass on a
table by table basis in future.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:27:17
Message-ID:	16598.1245680837@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> Alan Li <ali(at)truviso(dot)com> writes:
>>> How much concern is there for the contention for use cases where the WAL
>>> can't be bypassed?
>>
>> If you mean "is something going to be done about it in 8.4", the
>> answer is "no". This is a pre-existing issue that there is no simple
>> fix for.

> I thought he was asking if we intend to provide for WAL bypass on a
> table by table basis in future.

I thought he was asking for a solution to the problem of WALInsertLock
contention. In any case, we have "WAL bypass on a table by table basis"
now, don't we?

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:47:13
Message-ID:	4A3F9971.3010603@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> I thought he was asking for a solution to the problem of WALInsertLock
> contention. In any case, we have "WAL bypass on a table by table basis"
> now, don't we?
>
>
>

If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:51:20
Message-ID:	17657.1245682280@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> I thought he was asking for a solution to the problem of WALInsertLock
>> contention. In any case, we have "WAL bypass on a table by table basis"
>> now, don't we?

> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

Make it a temporary table.

The more useful case for data load is "create or truncate it in the
same transaction", of course.

regards, tom lane

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 14:53:28
Message-ID:	20090622145328.GZ8417@yugib.highrise.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Andrew Dunstan <andrew(at)dunslane(dot)net> [090622 10:47]:

> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

CREATE TEMPORARY TABLE ...

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:11:08
Message-ID:	20090622151108.GA20436@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> The more useful case for data load is "create or truncate it in the
> same transaction", of course.

Unfortunately, WAL bypass also requires not being in archive mode with
no way to turn that off w/o a server restart, aiui.

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:14:24
Message-ID:	18196.1245683664@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> The more useful case for data load is "create or truncate it in the
>> same transaction", of course.

> Unfortunately, WAL bypass also requires not being in archive mode with
> no way to turn that off w/o a server restart, aiui.

Well, if you're trying to archive then you certainly wouldn't want WAL
off, so I'm failing to see where this thread is going ...

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:18:42
Message-ID:	20090622151842.GB20436@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > Unfortunately, WAL bypass also requires not being in archive mode with
> > no way to turn that off w/o a server restart, aiui.
>
> Well, if you're trying to archive then you certainly wouldn't want WAL
> off, so I'm failing to see where this thread is going ...

I disagree. I'd love to be able to say "please bypass WAL logging for
this bulk load" because I know that I'll pick up the data during my next
full dump and I can reload it from original if I get disrupted before
then. This is especially true when you're doing bulk loads of static or
reference data from another data source.

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:24:10
Message-ID:	18460.1245684250@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> I was thinking it might be beneficial to be able to defer writing WAL
> until COPY is complete, so heap_sync would either fsync the whole heap
> file or copy the whole file to WAL.

What about indexes?

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:25:16
Message-ID:	1245684316.31430.126.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-06-22 at 11:14 -0400, Tom Lane wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> >> The more useful case for data load is "create or truncate it in the
> >> same transaction", of course.
>
> > Unfortunately, WAL bypass also requires not being in archive mode with
> > no way to turn that off w/o a server restart, aiui.
>
> Well, if you're trying to archive then you certainly wouldn't want WAL
> off, so I'm failing to see where this thread is going ...

I was thinking it might be beneficial to be able to defer writing WAL
until COPY is complete, so heap_sync would either fsync the whole heap
file or copy the whole file to WAL.

That would avoid writing WAL piecemeal because we could just backup the
whole block, plus we wouldn't write anything at all if the COPY failed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 15:34:49
Message-ID:	1245684889.31430.131.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-06-22 at 11:24 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > I was thinking it might be beneficial to be able to defer writing WAL
> > until COPY is complete, so heap_sync would either fsync the whole heap
> > file or copy the whole file to WAL.
>
> What about indexes?

I was thinking we could do exactly as stated for the cases that would be
WAL-bypass now, but need to write WAL because XLogArchivingActive().
i.e. improve the exact case we are measuring here.

Yes, it is more complex than that for loading to existing tables.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 16:42:47
Message-ID:	4A3FB487.30409@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>
>> Tom Lane wrote:
>>
>>> I thought he was asking for a solution to the problem of WALInsertLock
>>> contention. In any case, we have "WAL bypass on a table by table basis"
>>> now, don't we?
>>>
>
>
>> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?
>>
>
> Make it a temporary table.
>
>
>

That doesn't help if you need the data visible in multiple sessions. But
we're digressing from the original topic. Sorry.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 17:00:04
Message-ID:	20492.1245690004@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> I'm not convinced that WAL segment boundaries are particularly relevant
>> to this. The unit of flushing is an 8K page, not a segment.

> We fsync() the old WAL segment every time we switch to a new WAL
> segment. That's what I meant by "flush".

> If the walwriter is keeping up, it will fsync() the WAL more often, but
> 16MB is the maximum distance between fsync()s.

I'm still not convinced --- to my mind the issue is not whether fsyncs
happen but whether the COPY process has to wait for 'em, and I don't
think that segment boundaries directly affect that. I'd still be
interested to see similar measurements done with different wal_buffer
settings.

However, in the interests of getting this resolved in time for 8.4.0,
I propose that we just settle on 16MB as the bulkwrite ring buffer size.
There doesn't seem to be any evidence that a larger size will make for
a significant improvement, and we shouldn't allow COPY to trash a bigger
fraction of the arena than it really has to.

regards, tom lane

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Alan Li <ali(at)truviso(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 17:37:04
Message-ID:	39156D9B-6755-4EF0-99AD-C4539E45A952@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Le 22 juin 2009 à 17:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> a écrit :

> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> I was thinking it might be beneficial to be able to defer writing WAL
>> until COPY is complete, so heap_sync would either fsync the whole
>> heap
>> file or copy the whole file to WAL.
>
> What about indexes?

Skip this optimisation if there are any.
It's already Common practise to create them only after copy succeeded
when possible for better bulk loading perfs.

Then there's also the copy + Insert ... Select ... technique.

--
dim

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 20:18:35
Message-ID:	4A3FA0CB0200002500027E52@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I wonder though whether the wal_buffers setting interacts with the
> ring size. Has everyone who's tested this used the same 16MB
> wal_buffers setting as in Alan's original scenario?

I had been using his postgresql.conf file, then added autovacuum =
off. When I tried with setting the ring size to 16MB, I accidentally
left off the step to copy the postgresql.conf file, and got better
performance. I alternated between the postgresql.conf file from
earlier tests and the default file left there by the initdb, and got
this:

8.4rc1 with 16MB ring, default postgresql.conf
0m23.223s
0m23.489s
0m23.921s

8.4rc1 with 16MB ring, Alan's postgresql.conf
0m28.678s
0m26.171s
0m27.513s

default postgresql.conf (comments stripped)
max_connections = 100
shared_buffers = 32MB
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'

Alan's postgresql.conf (comments stripped)
shared_buffers = 256MB
wal_buffers = 16MB
checkpoint_segments = 100
autovacuum = off

I'm not going to claim I know why, but I thought I should report it.

Oh, and the 8.3.7 numbers and pre-patch numbers were averaging the
same under the day-time load as the replication sync mode. So, with
the ring size at 16MB this load is faster under 8.4 than 8.3.

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 20:28:37
Message-ID:	556.1245702517@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I wonder though whether the wal_buffers setting interacts with the
>> ring size. Has everyone who's tested this used the same 16MB
>> wal_buffers setting as in Alan's original scenario?

> I had been using his postgresql.conf file, then added autovacuum =
> off. When I tried with setting the ring size to 16MB, I accidentally
> left off the step to copy the postgresql.conf file, and got better
> performance.

Huh, that's bizarre. I can see that increasing shared_buffers should
make no difference in this test case (we're not using them all anyway).
But why should increasing wal_buffers make it slower? I forget the
walwriter's control algorithm at the moment ... maybe it works harder
when wal buffers are full?

BTW, I committed the change to use 16MB; that will be in RC2.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 20:33:23
Message-ID:	alpine.GSO.2.01.0906221626500.2428@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 22 Jun 2009, Kevin Grittner wrote:

> When I tried with setting the ring size to 16MB, I accidentally left off
> the step to copy the postgresql.conf file, and got better performance.

Do you have happen to have a build with assertions turned on? That is one
common cause of performance going down via increased shared_buffers that
people tend to run into during beta.

You can check via psql with

show debug_assertions;

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>,"Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 21:06:40
Message-ID:	4A3FAC100200002500027E5E@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> Do you have happen to have a build with assertions turned on?

Nope. I showed my ./configure options upthread, but can confirm with
pg_config:

BINDIR = /usr/local/pgsql-8.4rc1/bin
DOCDIR = /usr/local/pgsql-8.4rc1/share/doc
HTMLDIR = /usr/local/pgsql-8.4rc1/share/doc
INCLUDEDIR = /usr/local/pgsql-8.4rc1/include
PKGINCLUDEDIR = /usr/local/pgsql-8.4rc1/include
INCLUDEDIR-SERVER = /usr/local/pgsql-8.4rc1/include/server
LIBDIR = /usr/local/pgsql-8.4rc1/lib
PKGLIBDIR = /usr/local/pgsql-8.4rc1/lib
LOCALEDIR = /usr/local/pgsql-8.4rc1/share/locale
MANDIR = /usr/local/pgsql-8.4rc1/share/man
SHAREDIR = /usr/local/pgsql-8.4rc1/share
SYSCONFDIR = /usr/local/pgsql-8.4rc1/etc
PGXS = /usr/local/pgsql-8.4rc1/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/pgsql-8.4rc1'
'--enable-integer-datetimes' '--enable-debug' '--disable-nls'
'--with-libxml'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql-8.4rc1/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxml2 -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 8.4rc1

> You can check via psql with
>
> show debug_assertions;

OK, we'll do it your way. :-)

kgrittn(at)ATHENA:~/postgresql-8.4rc1> psql postgres
psql (8.4rc1)
Type "help" for help.

postgres=# show debug_assertions;
debug_assertions
------------------
off
(1 row)

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 21:36:22
Message-ID:	4A3FB3060200002500027E6E@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Huh, that's bizarre. I can see that increasing shared_buffers
> should make no difference in this test case (we're not using them
> all anyway). But why should increasing wal_buffers make it slower?
> I forget the walwriter's control algorithm at the moment ... maybe
> it works harder when wal buffers are full?

I created a postgresql.conf file with the options from the default
file, and then tried that by itself again, and with each of three
other options:

<none>
0m24.540s
0m24.630s
0m23.778s

checkpoint_segments = 100
0m30.251s
0m29.474s
0m26.604s

wal_buffers = 16MB
0m24.487s
0m23.939s
0m23.557s

shared_buffers = 256MB
0m25.885s
0m25.654s
0m24.025s

So the big hit seems to come from boosting checkpoint_segments,
although boosting shared_buffers seems to cause a slight slowdown.
Boosting wal_buffers seemed to help a little. Both of these last two,
though, are within the noise, so low confidence on those without a lot
more tests.

The checkpoint_segments seems dramatic enough to be real. I wonder if
the test is short enough that it never got around to re-using any of
them, so it was doing extra writes for the initial creation during the
test?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 21:40:54
Message-ID:	4A3FB4160200002500027E80@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> wrote:
>
>> A 25-30% performance regression in our main bulk loading mechanism
>> should at least be explained before the release...
>
> I think a performance regression of that magnitude merits holding
> up a release to resolve.

Wow. That sure took a long time to come through. I posted that
days ago....

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-22 23:01:42
Message-ID:	3147.1245711702@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> The checkpoint_segments seems dramatic enough to be real. I wonder if
> the test is short enough that it never got around to re-using any of
> them, so it was doing extra writes for the initial creation during the
> test?

That's exactly what I was about to suggest. Are you starting each run
from a fresh initdb? If so, try running the load long enough that the
number of WAL files stabilizes (should happen at 2x checkpoint_segments)
and then start the test measurement.

regards, tom lane

From:	Alan Li <ali(at)truviso(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-23 00:29:44
Message-ID:	782056770906221729o573f9511s184b966fc7118e2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 22, 2009 at 7:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alan Li <ali(at)truviso(dot)com> writes:
> > How much concern is there for the contention for use cases where the WAL
> > can't be bypassed?
>
> If you mean "is something going to be done about it in 8.4", the
> answer is "no". This is a pre-existing issue that there is no simple
> fix for.
>
> regards, tom lane
>
>
> No no, I am certainly not implying anything for the 8.4 timeframe.

Moving forward, I imagine this being more of a problem for data warehouse
applications, where bulk inserts occur on existing fact tables. In this
case, the WAL cannot be bypassed (unless the bulk insert occurs on a newly
created partition). And since COPY is cpu-bound, it would perhaps be
advantageous to do parallel COPY's on the same table on multi-core systems,
which won't work with WAL bypassing either.

Thanks, Alan

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Alan Li <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-23 06:20:56
Message-ID:	1245738056.31430.172.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-06-22 at 15:18 -0500, Kevin Grittner wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>
> default postgresql.conf (comments stripped)
> max_connections = 100
> shared_buffers = 32MB

This forces ring size to be 4MB, since min(32MB/8, ringsize).

Please re-run tests with your config, ring size 4MB, which should give
same answer

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-26 20:13:16
Message-ID:	4A44E58C0200002500027FCF@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> The checkpoint_segments seems dramatic enough to be real. I wonder
>> if the test is short enough that it never got around to re-using
>> any of them, so it was doing extra writes for the initial creation
>> during the test?
>
> That's exactly what I was about to suggest. Are you starting each
> run from a fresh initdb? If so, try running the load long enough
> that the number of WAL files stabilizes (should happen at 2x
> checkpoint_segments) and then start the test measurement.

default conf (xlogs not populated)
real 3m49.604s
real 3m47.225s
real 3m45.831s

default conf (xlogs populated)
real 3m45.603s
real 3m45.284s
real 3m45.906s

default conf + checkpoint_segments = 100 (xlogs not populated)
real 4m27.629s
real 4m24.496s
real 4m22.832s

default conf + checkpoint_segments = 100 (xlogs populated)
real 3m52.746s
real 3m52.619s
real 3m50.418s

I used ten times the number of rows, to get more meaningful results.
To get the "populated" times, I just dropped the target table and
created it again; otherwise identical runs. Clearly, pre-populating
the xlog files reduces run time, especially for a large number of xlog
files; however, I still got better performance with a smaller set of
xlog files.

Regarding the fact that even with the xlog files pre-populated, the
smaller set of xlog files is faster: I'm only guessing, but I suspect
the battery backed RAID controller is what's defeating conventional
wisdom here. By writing to the same, relatively small, set of xlog
files repeatedly, some of the actual disk writes probably evaporate in
the BBU cache. More frequent checkpoints from the smaller number of
xlog files might also have caused data to start streaming to the disk
a little sooner, minimizing write gluts later.

I've often seen similar benefits to the BBU cache which cause some of
the frequently-given advice here to have no discernible affect or be
counter-productive in our environment. (I know that some doubted that
my aggressive background writer settings didn't increase disk writes,
but I couldn't even measure a difference there in the writes from OS
cache to the controller cache, much less anything which indicated it
actually increased physical disk writes.)

By the way, the number of xlog files seemed to always go to two above
2x checkpoint_segments.

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-26 20:30:34
Message-ID:	28355.1246048234@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Regarding the fact that even with the xlog files pre-populated, the
> smaller set of xlog files is faster: I'm only guessing, but I suspect
> the battery backed RAID controller is what's defeating conventional
> wisdom here. By writing to the same, relatively small, set of xlog
> files repeatedly, some of the actual disk writes probably evaporate in
> the BBU cache.

Yeah, sounds plausible. How big is your BBU cache?

regards, tom lane

From:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-26 20:37:25
Message-ID:	50838.71.76.58.95.1246048645.squirrel@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, June 26, 2009 4:13 pm, Kevin Grittner wrote:
> By the way, the number of xlog files seemed to always go to two above
> 2x checkpoint_segments.

The docs say:

"There will always be at least one WAL segment file, and will normally not
be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1
files."

cheers

andrew

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>, "Alan Li" <ali(at)truviso(dot)com>
Subject:	Re: 8.4 open item: copy performance regression?
Date:	2009-06-26 21:20:23
Message-ID:	4A44F5470200002500027FDF@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> How big is your BBU cache?

On this machine, I guess it is 512MB. (Possibly 1GB, but I'm having
trouble finding the right incantation to check it at the moment, so
I'm going by what the hardware tech remembers.)

-Kevin