Re: Scaling shared buffer eviction

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mark Wong <mark(at)2ndquadrant(dot)com>
Subject: Re: Scaling shared buffer eviction
Date: 2014-10-01 18:54:39
Message-ID: 20141001185439.GD7158@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-09-25 16:50:44 +0200, Andres Freund wrote:
> On 2014-09-25 10:44:40 -0400, Robert Haas wrote:
> > On Thu, Sep 25, 2014 at 10:42 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > >> On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
> > >>> On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > >>> > That leads me to wonder: Have you measured different, lower, number of
> > >>> > buffer mapping locks? 128 locks is, if we'd as we should align them
> > >>> > properly, 8KB of memory. Common L1 cache sizes are around 32k...
> > >>>
> > >>> Amit has some results upthread showing 64 being good, but not as good
> > >>> as 128. I haven't verified that myself, but have no reason to doubt
> > >>> it.
> > >>
> > >> How about you push the spinlock change and I crosscheck the partition
> > >> number on a multi socket x86 machine? Seems worthwile to make sure that
> > >> it doesn't cause problems on x86. I seriously doubt it'll, but ...
> > >
> > > OK.
> >
> > Another thought is that we should test what impact your atomics-based
> > lwlocks have on this.
>
> Yes, I'd planned to test that as well. I think that it will noticeably
> reduce the need to increase the number of partitions for workloads that
> fit into shared_buffers. But it won't do much about exclusive
> acquirations of the buffer mapping locks. So I think there's independent
> benefit of increasing the number.

Here we go.

Postgres was configured with.
-c shared_buffers=8GB \
-c log_line_prefix="[%m %p] " \
-c log_min_messages=debug1 \
-p 5440 \
-c checkpoint_segments=600
-c max_connections=200

Each individual measurement (#TPS) is the result of a
pgbench -h /tmp/ -p 5440 postgres -n -M prepared -c $clients -j $clients -S -T 10
run.

Master is as of ef8863844bb0b0dab7b92c5f278302a42b4bf05a.

First, a scale 200 run. That fits entirely into shared_buffers:

#scale #client #partitions #TPS
200 1 16 8353.547724 8145.296655 8263.295459
200 16 16 171014.763118 193971.091518 133992.128348
200 32 16 259119.988034 234619.421322 201879.618322
200 64 16 178909.038670 179425.091562 181391.354613
200 96 16 141402.895201 138392.705402 137216.416951
200 128 16 125643.089677 124465.288860 122527.209125

(other runs here stricken, they were contorted due some concurrent
activity. But nothing interesting).

So, there's quite some variation in here. Not very surprising given the
short runtimes, but still.

Looking at a profile nearly all the contention is around
GetSnapshotData(). That might hide the interesting scalability effects
of the partition number. So I next tried my rwlock-contention branch.

#scale #client #partitions #TPS
200 1 1 8540.390223 8285.628397 8497.022656
200 16 1 136875.484896 164302.769380 172053.413980
200 32 1 308624.650724 240502.019046 260825.231470
200 64 1 453004.188676 406226.943046 406973.325822
200 96 1 442608.459701 450185.431848 445549.710907
200 128 1 487138.077973 496233.594356 457877.992783

200 1 16 9477.217454 8181.098317 8457.276961
200 16 16 154224.573476 170238.637315 182941.035416
200 32 16 302230.215403 285124.708236 265917.729628
200 64 16 405151.647136 443473.797835 456072.782722
200 96 16 443360.377281 457164.981119 474049.685940
200 128 16 490616.257063 458273.380238 466429.948417

200 1 64 8410.981874 11554.708966 8359.294710
200 16 64 139378.312883 168398.919590 166184.744944
200 32 64 288657.701012 283588.901083 302241.706222
200 64 64 424838.919754 416926.779367 436848.292520
200 96 64 462352.017671 446384.114441 483332.592663
200 128 64 471578.594596 488862.395621 466692.726385

200 1 128 8350.274549 8140.699687 8305.975703
200 16 128 144553.966808 154711.927715 202437.837908
200 32 128 290193.349170 213242.292597 261016.779185
200 64 128 413792.389493 431267.716855 456587.450294
200 96 128 490459.212833 456375.442210 496430.996055
200 128 128 470067.179360 464513.801884 483485.000502

Not much there either.

So, on to the next scale, 1000. That doesn't fit into s_b anymore.

master:
#scale #client #partitions #TPS
1000 1 1 7378.370717 7110.988121 7164.977746
1000 16 1 66439.037413 85151.814130 85047.296626
1000 32 1 71505.487093 75687.291060 69803.895496
1000 64 1 42148.071099 41934.631603 43253.528849
1000 96 1 33760.812746 33969.800564 33598.640121
1000 128 1 30382.414165 30047.284982 30144.576494

1000 1 16 7228.883843 9479.793813 7217.657145
1000 16 16 105203.710528 112375.187471 110919.986283
1000 32 16 146294.286762 145391.938025 144620.709764
1000 64 16 134411.772164 134536.943367 136196.793573
1000 96 16 107626.878208 105289.783922 96480.468107
1000 128 16 92597.909379 86128.040557 92417.727720

1000 1 64 7130.392436 12801.641683 7019.999330
1000 16 64 120180.196384 125319.373819 126137.930478
1000 32 64 181876.697461 190578.106760 189412.973015
1000 64 64 216233.590299 222561.774501 225802.194056
1000 96 64 171928.358031 165922.395721 168283.712990
1000 128 64 139303.139631 137564.877450 141534.449640

1000 1 128 8215.702354 7209.520152 7026.888706
1000 16 128 116196.740200 123018.284948 127045.761518
1000 32 128 183391.488566 185428.757458 185732.926794
1000 64 128 218547.133675 218096.002473 208679.436158
1000 96 128 155209.830821 156327.200412 157542.582637
1000 128 128 131127.769076 132084.933955 124706.336737

rwlock:
#scale #client #partitions #TPS
1000 1 1 7377.270393 7494.260136 7207.898866
1000 16 1 79289.755569 88032.480145 86810.772569
1000 32 1 83006.336151 88961.964680 88508.832253
1000 64 1 44135.036648 46582.727314 45119.421278
1000 96 1 35036.174438 35687.025568 35469.127697
1000 128 1 30597.870830 30782.335225 30342.454439

1000 1 16 7114.602838 7265.863826 7205.225737
1000 16 16 128507.292054 131868.678603 124507.097065
1000 32 16 212779.122153 185666.608338 210714.373254
1000 64 16 239776.079534 239923.393293 242476.922423
1000 96 16 169240.934839 166021.430680 169187.643644
1000 128 16 136601.409985 139340.961857 141731.068752

1000 1 64 13271.722885 11348.028311 12531.188689
1000 16 64 129074.053482 125334.720264 125140.499619
1000 32 64 198405.463848 196605.923684 198354.818005
1000 64 64 250463.474112 249543.622897 251517.159399
1000 96 64 251715.751133 254168.028451 251502.783058
1000 128 64 243596.368933 234671.592026 239123.259642

1000 1 128 7376.371403 7301.077478 7240.526379
1000 16 128 127992.070372 133537.637394 123382.418747
1000 32 128 185807.703422 194303.674428 184919.586634
1000 64 128 270233.496350 271576.483715 262281.662510
1000 96 128 266023.529574 272484.352878 271921.597420
1000 128 128 260004.301457 266710.469926 263713.245868

Based on this I think we can fairly conclude that increasing the number
of partitions is quite the win on larger x86 machines too. Independent
of the rwlock patch, although it moves the contention points to some
degree.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2014-10-01 18:56:32 Re: Yet another abort-early plan disaster on 9.3
Previous Message Jan Wieck 2014-10-01 17:36:01 Re: pg_receivexlog and replication slots