Re: Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline

Lists: pgsql-hackers
From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
Date: 2013-09-23 20:14:56
Message-ID: 20130923201456.GC32659@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Currently LWLOCK_PADDED_SIZE is defined as:

/*
* All the LWLock structs are allocated as an array in shared memory.
* (LWLockIds are indexes into the array.) We force the array stride to
* be a power of 2, which saves a few cycles in indexing, but more
* importantly also ensures that individual LWLocks don't cross cache line
* boundaries. This reduces cache contention problems, especially on AMD
* Opterons. (Of course, we have to also ensure that the array start
* address is suitably aligned.)
*
* LWLock is between 16 and 32 bytes on all known platforms, so these two
* cases are sufficient.
*/
#define LWLOCK_PADDED_SIZE (sizeof(LWLock) <= 16 ? 16 : 32)

typedef union LWLockPadded
{
LWLock lock;
char pad[LWLOCK_PADDED_SIZE];
} LWLockPadded;

So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte
boundaries. That means that on x86-64 (64byte cachelines, 24bytes
unpadded lwlock) two lwlocks share a cacheline. As struct LWLock
contains a spinlock and important lwlocks are often besides each other,
that strikes me as a bad idea.
Take for example the partitioned buffer mapping lock. This coding
essentially reduces the effect of partitioning by half in a readonly
workload where the only contention is the LWLock's spinlock itself.

Does anybody remember why this is done that way? The padding itself was
introduced in dc06734a .

In my benchmarks changing the padding to 64byte increases performance in
workloads with contended lwlocks considerably. 11% for a workload where
the buffer mapping lock is the major contention point, on a 2 socket
system.
Unfortunately increasing it to CACHE_LINE_SIZE/128 results in only a
2-3% increase.

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
Date: 2013-09-24 10:39:39
Message-ID: 5840.1380019179@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte
> boundaries. That means that on x86-64 (64byte cachelines, 24bytes
> unpadded lwlock) two lwlocks share a cacheline.

Yup.

> In my benchmarks changing the padding to 64byte increases performance in
> workloads with contended lwlocks considerably.

At a huge cost in RAM. Remember we make two LWLocks per shared buffer.

I think that rather than using a blunt instrument like that, we ought to
see if we can identify pairs of hot LWLocks and make sure they're not
adjacent.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
Date: 2013-09-24 10:48:11
Message-ID: 20130924104811.GA11964@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-24 12:39:39 +0200, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte
> > boundaries. That means that on x86-64 (64byte cachelines, 24bytes
> > unpadded lwlock) two lwlocks share a cacheline.

> > In my benchmarks changing the padding to 64byte increases performance in
> > workloads with contended lwlocks considerably.
>
> At a huge cost in RAM. Remember we make two LWLocks per shared buffer.

> I think that rather than using a blunt instrument like that, we ought to
> see if we can identify pairs of hot LWLocks and make sure they're not
> adjacent.

That's a good point. What about making all but the shared buffer lwlocks
64bytes? It seems hard to analyze the interactions between all the locks
and keep it maintained.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Reasoning behind LWLOCK_PADDED_SIZE/increase it to a full cacheline
Date: 2013-09-24 13:14:38
Message-ID: CA+TgmoZK0FHxNx21yaM1yW=eR-j1dssahM7h3zWj6TB+WwA=Zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 24, 2013 at 6:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-09-24 12:39:39 +0200, Tom Lane wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>> > So, what we do is we guarantee that LWLocks are aligned to 16 or 32byte
>> > boundaries. That means that on x86-64 (64byte cachelines, 24bytes
>> > unpadded lwlock) two lwlocks share a cacheline.
>
>> > In my benchmarks changing the padding to 64byte increases performance in
>> > workloads with contended lwlocks considerably.
>>
>> At a huge cost in RAM. Remember we make two LWLocks per shared buffer.
>
>> I think that rather than using a blunt instrument like that, we ought to
>> see if we can identify pairs of hot LWLocks and make sure they're not
>> adjacent.
>
> That's a good point. What about making all but the shared buffer lwlocks
> 64bytes? It seems hard to analyze the interactions between all the locks
> and keep it maintained.

I think somebody had a patch a few years ago that made it so that the
LWLocks didn't have to be in a single array, but could instead be
anywhere in shared memory. Internally, lwlock.c held onto LWLock
pointers instead of LWLockIds. That idea seems like it might be worth
revisiting, in terms of giving us more options as to how LWLocks can
be laid out in shared memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company