Re: Move PinBuffer and UnpinBuffer to atomics

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, YUriy Zhuravlev <u(dot)zhuravlev(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Move PinBuffer and UnpinBuffer to atomics
Date: 2016-04-07 01:58:50
Message-ID: CA+TgmoZJdA6K7-17K4A48rVB0UPR98HVuaNcfNNLrGsdb1uChg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 6, 2016 at 10:04 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Wed, Apr 6, 2016 at 3:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Which scale did you initialize with? I'm trying to reproduce the
>> workload on hydra as precisely as possible...
>
> I tested with scale factor 300, shared buffer 8GB.
>
> My test script is attached with the mail (perf_pgbench_ro.sh).
>
> I have done some more test on power (same machine)

I spent a lot of time testing things on power2 today and also
discussed with Andres via IM. Andres came up with a few ideas to
reduce the variability, which I tried. One of those was to run the
server under numactl --interleave=all (that is, numactl
--interleave=all pg_ctl start etc.) and another was to set
kernel.numa_balancing = 0 (it was set to 1). Neither of those things
seemed to prevent the problem of run-to-run variability. Andres also
suggested running pgbench with "-P 1", which revealed that it was
generally possible to tell what the overall performance of a run was
going to be like within 10-20 seconds. Runs that started out fast
stayed fast, and those that started out slower remained slower.
Therefore, long runs didn't seem to be necessary for testing, so I
switched to using 2-minute test runs launched via pgbench -T 120 -c 64
-j 64 -n -M prepared -S -P1.

After quite a bit of experimentation, Andres hit on an idea that did
succeed in drastically reducing the run-to-run variation: prewarming
all of the relations in a deterministic order before starting the
test. I used this query:

psql -c "select sum(x.x) from (select pg_prewarm(oid) as x from
pg_class where relkind in ('i', 'r') order by oid) x;"

With that change to my test script, the results became much more
stable. I tested four different builds of the server: commit 3fed4174
(that is, the one just before where you have been reporting the
variability to have begun), commit 6150a1b0 (the one you reported that
the variability actually did begin), master as of this morning
(actually commit cac0e366), and master + pinunpin-cas-9.patch +
0001-WIP-Avoid-the-use-of-a-separate-spinlock-to-protect-.patch +
backoff.patch (herein called "andres"). The first and last of these
have 64-byte BufferDescs, and the others have 80-byte BufferDescs.
Without prewarming, I see high and low results on *all* of these
builds, even 3fed4174. I did nine test runs with each configuration
with and without prewarming, and here are the results. With each
result I have reported the raw numbers, plus the median and the range
(highest result - lowest result).

-- without prewarming --
3fed4174
tps by run: 249165.928992 300958.039880 501281.083247
488073.289603 251033.038275 272086.063197
522287.023374 528855.922328 266057.502255
median: 300958.039880, range: 206687.132100

6150a1b0
tps by run: 455853.061092 438628.888369 353643.017507
419850.232715 424353.870515 440219.581180
431805.001465 237528.175877 431789.666417
median: 431789.666417, range: 218324.885215

master
tps by run: 427808.559919 366625.640433 376165.188508
441349.141152 363381.095198 352925.252345
348975.712841 446626.284308 444633.921009
median: 376165.188508, range: 97650.571467

andres
tps by run: 391123.866928 423525.415037 496103.017599
346707.246825 531194.321999 466100.337795
517708.470146 355392.837942 510817.921728
median: 466100.337795, range: 184487.075174

-- with prewarming --
3fed4174
tps by run: 413239.471751 428996.202541 488206.103941
493788.533491 497953.728875 498074.092686
501719.406720 508880.505416 509868.266778
median: 497953.728875, range: 96628.795027

6150a1b0
tps by run: 421589.299889 438765.851782 440601.270742
440649.818900 443033.460295 447317.269583
452831.480337 456387.316178 460140.159903
median: 443033.460295, range: 38550.860014

master
tps by run: 427211.917303 427796.174209 435601.396857
436581.431219 442329.269335 446438.814270
449970.003595 450085.123059 456231.229966
median: 442329.269335, range: 29019.312663

andres
tps by run: 425513.771259 429219.907952 433838.084721
451354.257738 494735.045808 495301.319716
517166.054466 531655.887669 546984.476602
median: 494735.045808, range: 121470.705343

My belief is that the first set of numbers has so much jigger that you
can't really draw any meaningful conclusions. For two of the four
branches, the range is more than 50% of the median, which is enormous.
You could probably draw some conclusions if you took enough
measurements, but it's pretty hard. Notice that that set of numbers
makes 6150a1b0 look like a performance improvement, whereas the second
set makes it pretty clear that 6150a1b0 was a regression.

Also, the patch set is clearly winning here. It picks up 90k TPS
median on the first set of numbers and 50k TPS median on the second
set.

It's fairly mysterious to me why there is so much jitter in the
results on this machine. By doing prewarming in a consistent fashion,
we make sure that every disk run puts the same disk blocks in the same
buffers. Andres guessed that maybe the degree of associativity of the
CPU caches comes into play here: depending on where the hot data is we
either get the important cache lines in places where they can all be
cached at once, or we get them in places where they can't all be
cached at once. But if that's really what is going on here, it's
shocking that it makes this much difference.

However, my conclusion based on these results is that (1) the patch is
a win and (2) the variability on this machine didn't begin with
6150a1b0. YMMV, of course, but that's what I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-04-07 01:59:43 Re: pgsql: Use GRANT system to manage access to sensitive functions
Previous Message Stephen Frost 2016-04-07 01:45:51 pgsql: In pg_dump, include pg_catalog and extension ACLs, if changed