Re: BufFreelistLock

Lists: pgsql-hackers
From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: BufFreelistLock
Date: 2010-12-09 04:28:00
Message-ID: AANLkTin+jt+fV9SKqFdkxy0RbbPg_LPeTpvCQVBwYQiJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think that the BufFreelistLock can be a contention bottleneck on a
system with a lot of CPUs that do a lot of shared-buffer allocations
which can fulfilled by the OS buffer cache. That is, read-mostly
queries where the working data set fits in RAM, but not in
shared_buffers. (You can always increase shared_buffers, but that
leads to other problems, and who wants to spend their time
micromanaging the size of shared_buffers as work loads slowly change?)

I can't prove it is a contention bottleneck without first solving the
putative problem and timing the difference, but it is the dominant
blocking lock showing up under LWLOCK_STATS for one benchmark I've
done using 8 CPUs.

So I had two questions.

1) Would it be useful for BufFreelistLock be partitioned, like
BufMappingLock, or via some kind of clever "virtual partitioning" that
could get the same benefit via another means? I don't know if both
the linked list and the clock sweep would have to be partitioned, or
if some other arrangement could be made

2) Could BufFreelistLock simply go away, by reducing it from a lwlock
to a spinlock? Or at least in most common paths?

For doing away with it, I think that any manipulation of the freelist
is short enough (just a few instructions) that it could be done under
a spinlock. If you somehow obtained a pinned or usage_count buffer,
you would have to retake the spinlock to look at the new head of the
chain, but the comments StrategyGetBuffer suggest that that should be
rare or impossible.

For the clock sweep algorithm, I think you could access
nextVictimBuffer without any type of locking. If a non-atomic
increment causes an occasional buffer to be skipped or examined twice,
that doesn't seem like a correctness problem. When nextVictimBuffer
gets reset to zero and completePasses gets incremented, that would
probably need to be protected to prevent a double-increment of
completePasses from throwing off the background writer's usage
estimations. But again, a spinlock should be enough for that. And it
shouldn't occur all that often.

If potentially inaccurate non-atomic increments of numBufferAllocs are
a problem, it could be incremented under the same spinlock used to
protect the test firstFreeBuffer>0 to determine if the freelist is
empty.

Doing away with the lock without some form of partitioning might just
move the contention to the BufHdr spinlocks. But if most of the
processes entering the code at about the same time perceive each
others increments to nextVictimBuffer, they would all start out offset
from each other and shouldn't collide too badly.

Does any of this sound like it might be fruitful to look into?

Cheers,

Jeff


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-09 04:49:21
Message-ID: 28956.1291870161@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> I think that the BufFreelistLock can be a contention bottleneck on a
> system with a lot of CPUs that do a lot of shared-buffer allocations
> which can fulfilled by the OS buffer cache.

Really? buffer/README says

The buffer
management policy is designed so that BufFreelistLock need not be taken
except in paths that will require I/O, and thus will be slow anyway.

It's hard to see how it's going to be much of a problem if you're going
to be doing kernel calls as well. Is the test case you're looking at
really representative of any common situation?

> 1) Would it be useful for BufFreelistLock be partitioned, like
> BufMappingLock, or via some kind of clever "virtual partitioning" that
> could get the same benefit via another means?

Maybe, but you could easily end up with a net loss if the partitioning
makes buffer allocation significantly stupider (ie, higher probability
of picking a less-than-optimal buffer to recycle).

> For the clock sweep algorithm, I think you could access
> nextVictimBuffer without any type of locking.

This is wrong, mainly because you wouldn't have any security against two
processes decrementing the usage count of the same buffer because they'd
fetched the same value of nextVictimBuffer. That would probably happen
often enough to severely compromise the accuracy of the usage counts and
thus the accuracy of the LRU eviction behavior. See above.

It might be worth looking into actual partitioning, so that more than
one processor can usefully be working on the usage count management.
But simply dropping the locking primitives isn't going to lead to
anything except severe screw-ups.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-09 05:44:13
Message-ID: AANLkTin3zLXNeft_55BTyEMc3Y9sX4aJBjxcASbxBO4b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 8, 2010 at 8:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> I think that the BufFreelistLock can be a contention bottleneck on a
>> system with a lot of CPUs that do a lot of shared-buffer allocations
>> which can fulfilled by the OS buffer cache.
>
> Really?  buffer/README says
>
>  The buffer
>  management policy is designed so that BufFreelistLock need not be taken
>  except in paths that will require I/O, and thus will be slow anyway.

True, but very large memory means they often don't require true disk I/O anyway.

> It's hard to see how it's going to be much of a problem if you're going
> to be doing kernel calls as well.

Are kernels calls really all that slow? I thought they had been
greatly optimized on recent hardware and kernels.
I'm not sure how to create a test case to distinguish that.

> Is the test case you're looking at
> really representative of any common situation?

That's always the question. I took the "pick a random number and use
it to look up a pgbench_accounts by primary key" logic from pgbench
-S,
and but it into a stored procedure where it loops 10,000 times, to
remove the overhead of ping-ponging messages back and forth for every
query.
(But doing so also removes the overhead of taking AccessShareLock for
every select, so those two changes are entangled.)

This type of workload could be representative of a nested loop join.

I started looking into it because someone
(http://archives.postgresql.org/pgsql-performance/2010-11/msg00350.php)
thought that that pgbench -S might more or less match their real world
work load. But by the time I moved most of selecting into a stored
procedure, maybe it no longer does (it's not even clear if they were
using prepared statements). But separating things into their
component potential bottlenecks, which do you tackle first? The more
fundamental. The easiest to analyze. The one that can't be gotten
around by fine-tuning. The more interesting :).

>> 1) Would it be useful for BufFreelistLock be partitioned, like
>> BufMappingLock, or via some kind of clever "virtual partitioning" that
>> could get the same benefit via another means?
>
> Maybe, but you could easily end up with a net loss if the partitioning
> makes buffer allocation significantly stupider (ie, higher probability
> of picking a less-than-optimal buffer to recycle).
>
>> For the clock sweep algorithm, I think you could access
>> nextVictimBuffer without any type of locking.
>
> This is wrong, mainly because you wouldn't have any security against two
> processes decrementing the usage count of the same buffer because they'd
> fetched the same value of nextVictimBuffer.  That would probably happen
> often enough to severely compromise the accuracy of the usage counts and
> thus the accuracy of the LRU eviction behavior.  See above.

Ah, I hadn't considered that.

Cheers,

Jeff


From: Jim Nasby <jim(at)nasby(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-09 19:54:24
Message-ID: 0449DD6E-83E9-445B-8850-8F3402C1CF56@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec 8, 2010, at 11:44 PM, Jeff Janes wrote:
>>> For the clock sweep algorithm, I think you could access
>>> nextVictimBuffer without any type of locking.
>>
>> This is wrong, mainly because you wouldn't have any security against two
>> processes decrementing the usage count of the same buffer because they'd
>> fetched the same value of nextVictimBuffer. That would probably happen
>> often enough to severely compromise the accuracy of the usage counts and
>> thus the accuracy of the LRU eviction behavior. See above.
>
> Ah, I hadn't considered that.

Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much what the performance of the sweep is. To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they strive to keep X number of pages on the free list so that when a process needs memory it can get it quickly.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-10 13:45:44
Message-ID: 1291988678-sup-5714@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010:

> Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much what the performance of the sweep is. To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they strive to keep X number of pages on the free list so that when a process needs memory it can get it quickly.

Isn't it what it does if you set bgwriter_lru_maxpages to some very
large value?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-10 15:24:34
Message-ID: AANLkTinnGtb-EsZbggLXFF_V=eYNhW14eOAcxZ9Sn2DR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 10, 2010 at 5:45 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010:
>
>> Ideally, the clock sweep would be run by bgwriter and not individual backends. In that case it shouldn't matter much what the performance of the sweep is.

Lock contention between the bgwriter and the individual backends would
matter very much. This might actually make things worse. Now you
need two BufFreelistLocks, one to stick it on the freelist, and one to
take it off.

>> To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they strive to keep X number of pages on the free list so that when a process needs memory it can get it quickly.
>
> Isn't it what it does if you set bgwriter_lru_maxpages to some very
> large value?

As far as I can tell, bgwriter never adds things to the freelist.
That is only done at start up, and when a relation or a database is
dropped. The clock sweep does the vast majority of the work.

But I could be wrong.

Cheers,

Jeff


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-10 16:39:51
Message-ID: 1291998960-sup-5081@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010:
> On Fri, Dec 10, 2010 at 5:45 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > Excerpts from Jim Nasby's message of jue dic 09 16:54:24 -0300 2010:

> >> To do that I think we'd want the bgwriter to target there being X number of buffers on the free list instead of (or in addition to) targeting how many dirty buffers need to be written. This would mirror what operating systems do; they strive to keep X number of pages on the free list so that when a process needs memory it can get it quickly.
> >
> > Isn't it what it does if you set bgwriter_lru_maxpages to some very
> > large value?
>
> As far as I can tell, bgwriter never adds things to the freelist.
> That is only done at start up, and when a relation or a database is
> dropped. The clock sweep does the vast majority of the work.

AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-10 16:49:40
Message-ID: 25070.1291999780@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010:
>> As far as I can tell, bgwriter never adds things to the freelist.
>> That is only done at start up, and when a relation or a database is
>> dropped. The clock sweep does the vast majority of the work.

> AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync).

I think bgwriter just tries to write out dirty buffers so they'll be
clean when the clock sweep reaches them. It doesn't try to move them to
the freelist. There might be some advantage in having it move buffers
to a freelist that's just protected by a simple spinlock (or at least,
a lock different from the one that protects the clock sweep). The
idea would be that most of the time, backends just need to lock the
freelist for long enough to take a buffer off it, and don't run clock
sweep at all.

regards, tom lane


From: Jim Nasby <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-13 02:48:37
Message-ID: DAED5995-42C5-488B-B1A6-C251968937E2@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec 10, 2010, at 10:49 AM, Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010:
>>> As far as I can tell, bgwriter never adds things to the freelist.
>>> That is only done at start up, and when a relation or a database is
>>> dropped. The clock sweep does the vast majority of the work.
>
>> AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync).
>
> I think bgwriter just tries to write out dirty buffers so they'll be
> clean when the clock sweep reaches them. It doesn't try to move them to
> the freelist.

Yeah, it calls SyncOneBuffer which does nothing for the clock sweep.

> There might be some advantage in having it move buffers
> to a freelist that's just protected by a simple spinlock (or at least,
> a lock different from the one that protects the clock sweep). The
> idea would be that most of the time, backends just need to lock the
> freelist for long enough to take a buffer off it, and don't run clock
> sweep at all.

Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will run the clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers are only put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers. So we make backends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queue up for the clock sweep lock and make them actually run the clock sweep.

BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went down enough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to 8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the larger servers also have 24 cores vs 16). My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTM the issue could just as well be due to clock sweep suddenly taking over 3x longer than before.

We're working on getting a performance test environment setup, so hopefully in a month or two we'd be able to actually run some testing on this.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Jim Nasby <jim(at)nasby(dot)net>
To: Nasby Jim <Jim(at)Nasby(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Herrera Alvaro <alvherre(at)commandprompt(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-13 21:12:04
Message-ID: 39023AD9-494D-43F3-A8DF-54DCEF9A8366@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec 12, 2010, at 8:48 PM, Jim Nasby wrote:
>> There might be some advantage in having it move buffers
>> to a freelist that's just protected by a simple spinlock (or at least,
>> a lock different from the one that protects the clock sweep). The
>> idea would be that most of the time, backends just need to lock the
>> freelist for long enough to take a buffer off it, and don't run clock
>> sweep at all.
>
> Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will run the clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers are only put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers. So we make backends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queue up for the clock sweep lock and make them actually run the clock sweep.

Looking at the code, it seems to be pretty trivial to have SyncOneBuffer decrement the usage count of every buffer it's handed. The challenge is that the code that estimates how many buffers we need to sync looks at where the clock hand is at, and I think it uses that information as part of it's calculation.

So the real challenge here is coming up with a good model for how many buffers we need to sync on each pass *and* how far the clock needs to be swept. There is also (currently) an interdependency here: the LRU scan will not sync buffers that have a usage_count > 0. So unless the clock sweep is being run well enough, the LRU scan becomes completely useless.

My thought is that the clock sweep should be scheduled the same way that OS VMs handle their free list: they attempt to keep X number of pages on the free list at all times. We already track the rate of buffer allocations, so that can be used to estimate how many pages are being consumed per cycle. Plus we'd want some number of extra pages as a buffer.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-14 17:08:19
Message-ID: AANLkTintDS7jDwRouBHxHQXnoJyaRHpGSzGCWw3Oyr+u@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 12, 2010 at 6:48 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Dec 10, 2010, at 10:49 AM, Tom Lane wrote:
>> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>>> Excerpts from Jeff Janes's message of vie dic 10 12:24:34 -0300 2010:
>>>> As far as I can tell, bgwriter never adds things to the freelist.
>>>> That is only done at start up, and when a relation or a database is
>>>> dropped.  The clock sweep does the vast majority of the work.
>>
>>> AFAIU bgwriter runs the clock sweep most of the time (BgBufferSync).
>>
>> I think bgwriter just tries to write out dirty buffers so they'll be
>> clean when the clock sweep reaches them.  It doesn't try to move them to
>> the freelist.
>
> Yeah, it calls SyncOneBuffer which does nothing for the clock sweep.
>
>> There might be some advantage in having it move buffers
>> to a freelist that's just protected by a simple spinlock (or at least,
>> a lock different from the one that protects the clock sweep).  The
>> idea would be that most of the time, backends just need to lock the
>> freelist for long enough to take a buffer off it, and don't run clock
>> sweep at all.
>
> Yeah, the clock sweep code is very intensive compared to pulling a buffer from the freelist, yet AFAICT nothing will run the clock sweep except backends. Unless I'm missing something, the free list is practically useless because buffers are only put there by InvalidateBuffer, which is only called by DropRelFileNodeBuffers and DropDatabaseBuffers.

Buffers are also put on the freelist at start up (all of them). But
of course any busy system with more data than buffers will rapidly
deplete them, and DropRelFileNodeBuffers and DropDatabaseBuffers are
generally not going to happen enough to be meaningful on most setups,
I would think. I was wondering, if the steady state condition is to
always use the clock sweep, if that shouldn't be the only mechanism
that exists.

> So we make backends queue up behind the freelist lock with very little odds of getting a buffer, then we make them queue up for the clock sweep lock and make them actually run the clock sweep.

It is the same lock that governs both. Given the simplicity of the
checking that the freelist is empty, I don't think it adds much
overhead.

>
> BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went down enough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to 8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the larger servers also have 24 cores vs 16).

What kind of work load do you have (intensity of reading versus
writing)? How intensely concurrent is the access?

> My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTM the issue could just as well be due to clock sweep suddenly taking over 3x longer than before.

It would surprise me if most clock sweeps need to make anything near a
full pass over the buffers for each allocation (but technically it
wouldn't need to do that take 3x longer. It could be that the
fraction of a pass it needs to make is merely proportional to
shared_buffers. That too would surprise me, though). You could
compare the number of passes with the number of allocations to see how
much sweeping is done per allocation. However, I don't think the
number of passes is reported anywhere, unless you compile with #define
BGW_DEBUG and
run with debug2.

I wouldn't expect an increase in shared_buffers to make contention on
BufFreelistLock worse. If the increased buffers are used to hold
heavily-accessed data, then you will find the pages you want in
shared_buffers more often, and so need to run the clock-sweep less
often. That should make up for longer sweeps. But if the increased
buffers are used to hold data that is just read once and thrown away,
then the clock sweep shouldn't need to sweep very far before finding a
candidate.

But of course being able to test would be better than speculation.

Cheers,

Jeff


From: Jim Nasby <jim(at)nasby(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-14 21:42:06
Message-ID: DC555169-6758-4996-B51C-E9B3845385BC@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:

> On Sun, Dec 12, 2010 at 6:48 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>
>> BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went down enough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to 8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the larger servers also have 24 cores vs 16).
>
> What kind of work load do you have (intensity of reading versus
> writing)? How intensely concurrent is the access?

It writes at the rate of ~3-5MB/s, doing ~700TPS on average. It's hard to judge the exact read mix, because it's running on a 192G server (actually, 512G now, but 192G when I tested). The working set is definitely between 96G and 192G; we saw a major performance improvement last year when we went to 192G, but we haven't seen any improvement moving to 512G.

We typically have 10-20 active queries at any point.

>> My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTM the issue could just as well be due to clock sweep suddenly taking over 3x longer than before.
>
> It would surprise me if most clock sweeps need to make anything near a
> full pass over the buffers for each allocation (but technically it
> wouldn't need to do that take 3x longer. It could be that the
> fraction of a pass it needs to make is merely proportional to
> shared_buffers. That too would surprise me, though). You could
> compare the number of passes with the number of allocations to see how
> much sweeping is done per allocation. However, I don't think the
> number of passes is reported anywhere, unless you compile with #define
> BGW_DEBUG and
> run with debug2.
>
> I wouldn't expect an increase in shared_buffers to make contention on
> BufFreelistLock worse. If the increased buffers are used to hold
> heavily-accessed data, then you will find the pages you want in
> shared_buffers more often, and so need to run the clock-sweep less
> often. That should make up for longer sweeps. But if the increased
> buffers are used to hold data that is just read once and thrown away,
> then the clock sweep shouldn't need to sweep very far before finding a
> candidate.

Well, we're talking about a working set that's between 96 and 192G, but only 8G (or 28G) of shared buffers. So there's going to be a pretty large amount of buffer replacement happening. We also have 210 tables where the ratio of heap buffer hits to heap reads is over 1000, so the stuff that is in shared buffers probably keeps usage_count quite high. Put these two together, and we're probably spending a fairly significant amount of time running the clock sweep.

Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbing something off of the free list (assuming we had separate locks for the two operations). Does anyone know what the overhead of getting a block from the filesystem cache is? I wonder how many buffers you can move through in the same amount of time. Put another way, at some point you have to check enough buffers to find a free one that you just doubled the amount of time it takes to get data from the filesystem cache into a shared buffer.

> But of course being able to test would be better than speculation.

Yeah, I'm working on getting pg_buffercache installed so we can see what's actually in the cache.

Hmm... I wonder how hard it would be to hack something up that has a separate process that does nothing but run the clock sweep. We'd obviously not run a hack in production, but we're working on being able to reproduce a production workload. If we had a separate clock-sweep process we could get an idea of exactly how much work was involved in keeping free buffers available.

BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-15 20:40:14
Message-ID: AANLkTimUr3KHCXXk6TmbpsAFqOj20_M3qAZsSSk6s7TL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>
> On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:
>

>> I wouldn't expect an increase in shared_buffers to make contention on
>> BufFreelistLock worse.  If the increased buffers are used to hold
>> heavily-accessed data, then you will find the pages you want in
>> shared_buffers more often, and so need to run the clock-sweep less
>> often.  That should make up for longer sweeps.  But if the increased
>> buffers are used to hold data that is just read once and thrown away,
>> then the clock sweep shouldn't need to sweep very far before finding a
>> candidate.
>
> Well, we're talking about a working set that's between 96 and 192G, but
> only 8G (or 28G) of shared buffers. So there's going to be a pretty
> large amount of buffer replacement happening. We also have
> 210 tables where the ratio of heap buffer hits to heap reads is
> over 1000, so the stuff that is in shared buffers probably keeps
> usage_count quite high. Put these two together, and we're probably
> spending a fairly significant amount of time running the clock sweep.

The thing that makes me think the bottleneck is elsewhere is that
increasing from 8G to 28G made it worse. If buffer unpins are
happening at about the same rate, then my gut feeling is that the
clock sweep has to do about the same amount of decrementing before it
gets to a free buffer under steady state conditions. Whether it has
to decrement 8G in buffers three and a half times each, or 28G of
buffers one time each, it would do about the same amount of work.
This is all hand waving, of course.

> Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbing something off of the free list (assuming we had separate locks for the two operations).

But do we actually know that? Doing a clock sweep is only a lot of
overhead if it has to pass over many buffers in order to find a good
one, and we don't know the numbers on that. I think you can sweep a
lot of buffers for the overhead of a single contended lock.

If the sweep and the freelist had separate locks, you still need to
lock the freelist to add to it things discovered during the sweep.

> Does anyone know what the overhead of getting a block from the filesystem cache is?

I did tests on this a few days ago. It took on average 20
microseconds per row to select one row via primary key when everything
was in shared buffers.
When everything was in RAM but not shared buffers, it took 40
microseconds. Of this, about 10 microseconds were the kernel calls to
seek and read from OS cache to shared_buffers, and the other 10
microseconds is some kind of PG overhead, I don't know where. The
timings are per select, not per page, and one select usually reads two
pages, one for the index leaf and one for the table.

This was all single-client usage on 2.8GHz AMD Opteron. Not all the
components of the timings will scale equally with additional clients
on additional CPUs of course. I think the time spent in the kernel
calls to do the seek and read will scale better than most other parts.

> BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.

As long as you are adding #define BGW_DEBUG and recompiling, you might
as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c
to DEBUG1 or LOG. I think this will only generate a couple log
message per bgwriter_delay. That should be tolerable, especially for
testing purposes.

Cheers,

Jeff


From: Jim Nasby <jim(at)nasby(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BufFreelistLock
Date: 2010-12-15 23:02:24
Message-ID: 4C7C05F7-9360-4709-99EA-57E7B58199AB@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Dec 15, 2010, at 2:40 PM, Jeff Janes wrote:
> On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>
>> On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:
>>> I wouldn't expect an increase in shared_buffers to make contention on
>>> BufFreelistLock worse. If the increased buffers are used to hold
>>> heavily-accessed data, then you will find the pages you want in
>>> shared_buffers more often, and so need to run the clock-sweep less
>>> often. That should make up for longer sweeps. But if the increased
>>> buffers are used to hold data that is just read once and thrown away,
>>> then the clock sweep shouldn't need to sweep very far before finding a
>>> candidate.
>>
>> Well, we're talking about a working set that's between 96 and 192G, but
>> only 8G (or 28G) of shared buffers. So there's going to be a pretty
>> large amount of buffer replacement happening. We also have
>> 210 tables where the ratio of heap buffer hits to heap reads is
>> over 1000, so the stuff that is in shared buffers probably keeps
>> usage_count quite high. Put these two together, and we're probably
>> spending a fairly significant amount of time running the clock sweep.
>
> The thing that makes me think the bottleneck is elsewhere is that
> increasing from 8G to 28G made it worse. If buffer unpins are
> happening at about the same rate, then my gut feeling is that the
> clock sweep has to do about the same amount of decrementing before it
> gets to a free buffer under steady state conditions. Whether it has
> to decrement 8G in buffers three and a half times each, or 28G of
> buffers one time each, it would do about the same amount of work.
> This is all hand waving, of course.

While we're waving hands... I think the issue is that our working set size is massive. That means that there will be a lot of activity driving usage_count up on buffers. Increasing shared buffers will help reduce that effect as they begin to contain more and more of the working set, but I suspect that going from 8G to 28G wouldn't have made much difference. That means that we now have *more* buffers with a high usage count that the sweep has to slog through.

Anyway, once I'm able to get the buffer stats contrib module installed we'll have a better idea of what's actually happening.

>> Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbing something off of the free list (assuming we had separate locks for the two operations).
>
> But do we actually know that? Doing a clock sweep is only a lot of
> overhead if it has to pass over many buffers in order to find a good
> one, and we don't know the numbers on that. I think you can sweep a
> lot of buffers for the overhead of a single contended lock.
>
> If the sweep and the freelist had separate locks, you still need to
> lock the freelist to add to it things discovered during the sweep.

I'm hoping we could actually use separate locks for adding and removing, assuming we discover this is actually a consideration.

>> Does anyone know what the overhead of getting a block from the filesystem cache is?
>
> I did tests on this a few days ago. It took on average 20
> microseconds per row to select one row via primary key when everything
> was in shared buffers.
> When everything was in RAM but not shared buffers, it took 40
> microseconds. Of this, about 10 microseconds were the kernel calls to
> seek and read from OS cache to shared_buffers, and the other 10
> microseconds is some kind of PG overhead, I don't know where. The
> timings are per select, not per page, and one select usually reads two
> pages, one for the index leaf and one for the table.
>
> This was all single-client usage on 2.8GHz AMD Opteron. Not all the
> components of the timings will scale equally with additional clients
> on additional CPUs of course. I think the time spent in the kernel
> calls to do the seek and read will scale better than most other parts.

Interesting info. I wonder if that 10us of unknown overhead was related to shared buffers. Do you know if you had room in shared buffers when you ran that test? It would be interesting to see the differences between having buffers on the free list, no buffers on the free list but buffers with 0 usage count (though, I'm not sure how you could set that up), and shared buffers with high usage count.

>> BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.
>
> As long as you are adding #define BGW_DEBUG and recompiling, you might
> as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c
> to DEBUG1 or LOG. I think this will only generate a couple log
> message per bgwriter_delay. That should be tolerable, especially for
> testing purposes.

Good ideas; I'll try to get that in place once we can benchmark, though it'll be easier to get pg_buffercache in place, so I'll focus on that first.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net