Re: Further reduction of bufmgr lock contention

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: "Gavin Hamill" <gdh(at)acentral(dot)co(dot)uk>
Subject: Further reduction of bufmgr lock contention
Date: 2006-04-21 17:01:36
Message-ID: 12051.1145638896@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been looking into Gavin Hamill's recent report of poor performance
with PG 8.1 on an 8-way IBM PPC64 box. strace'ing backends shows a lot
of semop() calls, indicating blocking at the LWLock or lmgr-lock levels,
but not a lot of select() delays, suggesting we don't have too much of a
problem at the hardware spinlock level. A typical breakdown of
different kernel call types is

566 _llseek
10 brk
10 gettimeofday
4 mmap
4 munmap
562 read
4 recv
8 select
3014 semop
12 send
1 time
3 write

(I'm hoping to get some oprofile results to confirm there's nothing
strange going on at the hardware level, but no luck yet on getting
oprofile to work on Debian/PPC64 ... anyone know anything about suitable
kernels to use for that?)

Instrumenting LWLockAcquire (with a patch I had developed last fall,
but just now got around to cleaning up and committing to CVS) shows
that the contention is practically all for the BufMappingLock:

$ grep ^PID postmaster.log | sort +9nr | head -20
PID 23820 lwlock 0: shacq 2446470 exacq 6154 blk 12755
PID 23823 lwlock 0: shacq 2387597 exacq 4297 blk 9255
PID 23824 lwlock 0: shacq 1678694 exacq 4433 blk 8692
PID 23826 lwlock 0: shacq 1221221 exacq 3224 blk 5893
PID 23821 lwlock 0: shacq 1892453 exacq 1665 blk 5766
PID 23835 lwlock 0: shacq 2390685 exacq 1453 blk 5511
PID 23822 lwlock 0: shacq 1669419 exacq 1615 blk 4926
PID 23830 lwlock 0: shacq 1039468 exacq 1248 blk 2946
PID 23832 lwlock 0: shacq 698622 exacq 397 blk 1818
PID 23836 lwlock 0: shacq 544472 exacq 530 blk 1300
PID 23839 lwlock 0: shacq 497505 exacq 46 blk 885
PID 23842 lwlock 0: shacq 305281 exacq 1 blk 720
PID 23840 lwlock 0: shacq 317554 exacq 226 blk 355
PID 23840 lwlock 2: shacq 0 exacq 2872 blk 7
PID 23835 lwlock 2: shacq 0 exacq 3434 blk 6
PID 23835 lwlock 1: shacq 0 exacq 1452 blk 4
PID 23822 lwlock 1: shacq 0 exacq 1614 blk 3
PID 23820 lwlock 2: shacq 0 exacq 3582 blk 2
PID 23821 lwlock 1: shacq 0 exacq 1664 blk 2
PID 23830 lwlock 1: shacq 0 exacq 1247 blk 2

These numbers show that our rewrite of the bufmgr has done a great job
of cutting down the amount of potential contention --- most of the
traffic on this lock is shared rather than exclusive acquisitions ---
but it seems that if you have enough CPUs it's still not good enough.
(My best theory as to why Gavin is seeing better performance from a
dual Opteron is simply that 2 processors will have 1/4th as much
contention as 8 processors.)

I have an idea about how to improve matters: I think we could break the
buffer tag to buffer mapping hashtable into multiple partitions based on
some hash value of the buffer tags, and protect each partition under a
separate LWLock, similar to what we did with the lmgr lock table not
long ago. Anyone have a comment on this strategy, or a better idea?

regards, tom lane


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, "Gavin Hamill" <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-21 17:39:45
Message-ID: 36e682920604211039h2d25a25h631023a93188803f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/21/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I've been looking into Gavin Hamill's recent report of poor performance
> with PG 8.1 on an 8-way IBM PPC64 box.

We have recently encountered some odd performance with 8.2dev on a
16-way Opteron. In the next few days we'll look into it and see if it
may be related.

Otherwise, it sounds good to me; a trial of the new strategy should
definitely prove its value one way or the other.

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-21 20:27:48
Message-ID: 1145651268.3112.95.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote:
> I've been looking into Gavin Hamill's recent report of poor performance
> with PG 8.1 on an 8-way IBM PPC64 box.

Ah good.

> Instrumenting LWLockAcquire (with a patch I had developed last fall,
> but just now got around to cleaning up and committing to CVS) shows
> that the contention is practically all for the BufMappingLock:

> $ grep ^PID postmaster.log | sort +9nr | head -20
> PID 23820 lwlock 0: shacq 2446470 exacq 6154 blk 12755
> PID 23823 lwlock 0: shacq 2387597 exacq 4297 blk 9255
> PID 23824 lwlock 0: shacq 1678694 exacq 4433 blk 8692
> PID 23826 lwlock 0: shacq 1221221 exacq 3224 blk 5893

BufMappingLock contention can be made worse by a poorly tuned bgwriter
or if the cache hit rate is low. Perhaps in this case, increasing
shared_buffers (again) might be enough to further reduce the contention?

When we discussed this before
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00702.php
ISTM then that a low shared_buffers cache hit rate combined with a high
OS cache hit rate will cause high contention in an SMP environment.

> These numbers show that our rewrite of the bufmgr has done a great job
> of cutting down the amount of potential contention --- most of the
> traffic on this lock is shared rather than exclusive acquisitions ---
> but it seems that if you have enough CPUs it's still not good enough.
> (My best theory as to why Gavin is seeing better performance from a
> dual Opteron is simply that 2 processors will have 1/4th as much
> contention as 8 processors.)

Jonah mentions some 16-way CPU testing we have just begun. There are
some interesting effects to decode, but most importantly all the CPUs do
stay at 100% for much of the time (when other tuning has been done). So
my feeling is that the BufMappingLock contention seen by Gavin is much
worse than we see. (...and I had been thinking to investigate further
with him on that point, though have just arrived back in UK).

Another difference is the amount of read/write. My understanding is that
Gavin's workload is mostly read-only which will greatly increase the
buffer request rate since backends will spend proportionally more time
consuming data and less time in xlog (etc).

My understanding is that contention increases geometrically with number
of potential lock holders (i.e. CPUs).

> I have an idea about how to improve matters: I think we could break the
> buffer tag to buffer mapping hashtable into multiple partitions based on
> some hash value of the buffer tags, and protect each partition under a
> separate LWLock, similar to what we did with the lmgr lock table not
> long ago. Anyone have a comment on this strategy, or a better idea?

I think this is the right way to go
http://archives.postgresql.org/pgsql-hackers/2005-02/msg00240.php
though the work for 8.1 was right to have been performed first.

The earlier lmgr lock partitioning had a hard-coded number of
partitions, which was sensible because of the reduced likelihood of
effectiveness beyond a certain number of partitions. That doesn't follow
here since the BufMappingLock contention will vary with the size of
shared_buffers and with the number of CPUs in use (for a given
workload). I'd like to see the partitioning calculated at server startup
either directly from shared_buffers or via a parameter. We may not be
restricted to only using a hash function as we were with lmgr, perhaps
using a simple range partitioning.

Test-wise: May be able to trial something next week, though system
access not yet confirmed and I'm not sure we'll see an improvement on
the workload we're testing on currently. I'll have a think about a pure
test that we can run on both systems to measure the contention.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-21 21:38:01
Message-ID: 13960.1145655481@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote:
>> I've been looking into Gavin Hamill's recent report of poor performance
>> with PG 8.1 on an 8-way IBM PPC64 box.

> BufMappingLock contention can be made worse by a poorly tuned bgwriter
> or if the cache hit rate is low. Perhaps in this case, increasing
> shared_buffers (again) might be enough to further reduce the contention?

Well, the cache hit rate is evidently pretty high, since so few of the
BufMappingLock accesses are exclusive (any miss would result in an
exclusive access). I did try increasing shared_buffers (from 40000 to
80000) but it didn't change performance noticeably.

> Another difference is the amount of read/write. My understanding is that
> Gavin's workload is mostly read-only which will greatly increase the
> buffer request rate since backends will spend proportionally more time
> consuming data and less time in xlog (etc).

I believe the particular test case being looked at here is read-only
(Gavin, is that correct?)

> The earlier lmgr lock partitioning had a hard-coded number of
> partitions, which was sensible because of the reduced likelihood of
> effectiveness beyond a certain number of partitions. That doesn't follow
> here since the BufMappingLock contention will vary with the size of
> shared_buffers and with the number of CPUs in use (for a given
> workload). I'd like to see the partitioning calculated at server startup
> either directly from shared_buffers or via a parameter. We may not be
> restricted to only using a hash function as we were with lmgr, perhaps
> using a simple range partitioning.

I don't think any of that follows; and a large number of partitions is
risky because it increases the probability of exhausting shared memory
(due to transient variations in the actual size of the hashtables for
different partitions).

> Test-wise: May be able to trial something next week, though system
> access not yet confirmed and I'm not sure we'll see an improvement on
> the workload we're testing on currently. I'll have a think about a pure
> test that we can run on both systems to measure the contention.

Keep in mind that Gavin's 8-way turns back into a pumpkin on Monday :-(

regards, tom lane


From: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-21 22:09:42
Message-ID: 20060421230942.85e1342e.gdh@acentral.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 21 Apr 2006 17:38:01 -0400
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I believe the particular test case being looked at here is read-only
> (Gavin, is that correct?)

Yes - I made sure the devels made it readonly so I could farm search
requests out to Slony-replicated machines (ended up running live
searches for the whole site on a host hundreds of miles away :)

> Keep in mind that Gavin's 8-way turns back into a pumpkin on
> Monday :-(

Aye, it would've been gone earlier today but the rental company were
being a bit slack so pushed it back to Monday. The pickup is
already arranged so I can't stall them at this stage.

I guess I could play the 'help the greater good by lending your kit for
open source devel' card with them once they get it back to their office.

Otherwise, I've seen at least one offer of pSeries hardware on
the performance list - I'm sure I could make available a version of our
data with all sensitive stuff removed so that it could be tested on
other machines.

.. and to top it all off, I didn't even get to go to the ball - and I
doubt there'll be a glass slipper on offer...

gdh


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-21 22:28:00
Message-ID: 1145658480.3112.108.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-04-21 at 17:38 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> > The earlier lmgr lock partitioning had a hard-coded number of
> > partitions, which was sensible because of the reduced likelihood of
> > effectiveness beyond a certain number of partitions. That doesn't follow
> > here since the BufMappingLock contention will vary with the size of
> > shared_buffers and with the number of CPUs in use (for a given
> > workload). I'd like to see the partitioning calculated at server startup
> > either directly from shared_buffers or via a parameter. We may not be
> > restricted to only using a hash function as we were with lmgr, perhaps
> > using a simple range partitioning.
>
> I don't think any of that follows; and a large number of partitions is
> risky because it increases the probability of exhausting shared memory
> (due to transient variations in the actual size of the hashtables for
> different partitions).

lmgr partitioning uses either 4 or 16, restricted by the hash function,
for various reasons. I see no similar restriction on using a hash
function here - we could equally well use range partitioning. That
relieves the restriction on the number of partitions, allowing us either
more or less partitions, according to need. We can place a limit on that
if you see a problem - at what level do you see a problem?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-22 01:39:53
Message-ID: 2778.1145669993@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> lmgr partitioning uses either 4 or 16, restricted by the hash function,
> for various reasons. I see no similar restriction on using a hash
> function here - we could equally well use range partitioning.

I don't really see any difference at all between the two cases as far as
what hashing we use.

The thing that's nagging at me at the moment is the realization that a
partitioned hashtable will eat more shared memory than a single
hashtable. It wasn't that long ago that we had to do some hacking to
ensure that the buffer hashtable couldn't run out of memory after
startup, and I'm afraid of re-introducing that failure mode. The lock
manager can run out of memory without crashing the system, but the
bufmgr can't (or at least could not in the recent past...)

Now that we're considering using partitioning methods for both the
buffer and lock hashtables, I wonder if there is any value in teaching
dynahash.c itself about concurrent access --- that is, find a way to use
a single shared hashtable instead of separate hashtables for the
different partitions, by having the hashtable itself deal with
concurrency to some extent. This is just handwaving at the moment...

regards, tom lane


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-24 07:09:30
Message-ID: e2htpt$2j5l$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> The thing that's nagging at me at the moment is the realization that a
> partitioned hashtable will eat more shared memory than a single
> hashtable. It wasn't that long ago that we had to do some hacking to
> ensure that the buffer hashtable couldn't run out of memory after
> startup, and I'm afraid of re-introducing that failure mode. The lock
> manager can run out of memory without crashing the system, but the
> bufmgr can't (or at least could not in the recent past...)
>

IHMO overflow is not avoidable no matter we use hash or range. Theoretically
seems we could have a data structure like this: (1) a set of k partition
tables, each is with a LWLock and size NBuffers/k; (2) a set of k overflow
tables (actually we only need k-1) plus a LWLock protecting them, each is
with size NBuffers/k. If any partition table overflows, we can assign a
overflow table for it to contain extra hash elements. At run time, the hash
tables for buffer pool may look like this:

[partition 0]
[partition 1][overflow 2]
[partition 2][overflow 0]
[partition 3]

But I am not sure how difficult to implement it in current hash code -
another handwaiving ...

Regards,
Qingqing


From: Mark Wong <markw(at)osdl(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-04-24 16:29:11
Message-ID: 444CFCD7.1060403@osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> On Fri, 2006-04-21 at 13:01 -0400, Tom Lane wrote:
>>> I've been looking into Gavin Hamill's recent report of poor performance
>>> with PG 8.1 on an 8-way IBM PPC64 box.
>
> Keep in mind that Gavin's 8-way turns back into a pumpkin on Monday :-(

I have a 4-way dual-core POWER5 system available...

Mark


From: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-23 15:54:22
Message-ID: 4473302E.9030502@acentral.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I've been looking into Gavin Hamill's recent report of poor performance
> with PG 8.1 on an 8-way IBM PPC64 box.

[...]

Hullo again :)

I'm unfamiliar with postgres development practices, so this is more a
request for information than anything else.

It's been about a month since the last activity on bufmgr as documented
on the hackers list and I was just concerned that this issue had been
filed as an interesting toy at the time, but now left for the circular
filing cabinet :)

Tom + Simon were able to see a fairly easy 25% performance boost against
our dataset and I'd obv. be very keen to see this work make it into
8.1.4 or 8.2.0 :)

Cheers,
Gavin.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-23 16:05:34
Message-ID: 14969.1148400334@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Hamill <gdh(at)acentral(dot)co(dot)uk> writes:
> It's been about a month since the last activity on bufmgr as documented
> on the hackers list and I was just concerned that this issue had been
> filed as an interesting toy at the time, but now left for the circular
> filing cabinet :)

> Tom + Simon were able to see a fairly easy 25% performance boost against
> our dataset and I'd obv. be very keen to see this work make it into
> 8.1.4 or 8.2.0 :)

We're certainly not putting any such thing into 8.1.*. The proposed
patch for 8.2 is stalled ATM because of the problem of not having a
predictable size for the per-partition hash tables. Fixed-size shared
memory is a harsh mistress :-(

regards, tom lane


From: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
To:
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-24 08:14:47
Message-ID: 447415F7.4030501@acentral.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> We're certainly not putting any such thing into 8.1.*. The proposed
> patch for 8.2 is stalled ATM because of the problem of not having a
> predictable size for the per-partition hash tables. Fixed-size shared
> memory is a harsh mistress :-(

Fair enough :)

Just wanted to ascertain that it was still a going concern - I have full
confidence that you'll have a brainwave one morning as to the perfect
solution =)

gdh


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-24 18:54:06
Message-ID: 200605241154.07177.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

BTW, we're going to be testing this patch on Sun Niagara servers. What's
the outstanding bug with it? I don't quite follow. I think I can get
some of the Sun MDEs to take a stab at it if I can understand the issue.
Links ok if maybe I've not found part of this thread in the archives.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-24 19:25:26
Message-ID: 909.1148498726@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> BTW, we're going to be testing this patch on Sun Niagara servers. What's
> the outstanding bug with it? I don't quite follow.

It's not acceptable as-is because of the risk of running out of shared
memory for hashtable entries. In the existing code, there's a clear
upper bound on the number of entries in the block-number-to-buffer hash
table, ie, shared_buffers + 1 (the +1 because we acquire the new entry
before releasing the old when reassigning a buffer). With multiple
hashtables serving subsets of the buffers, the different tables might
at different times need different numbers of entries, and that makes it
a lot harder to be sure you won't run out of memory. I don't say it's
insoluble, but the current patch wasn't even claimed to be safe by its
author...

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-05-30 01:29:08
Message-ID: 200605300129.k4U1T8022210@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is this a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > BTW, we're going to be testing this patch on Sun Niagara servers. What's
> > the outstanding bug with it? I don't quite follow.
>
> It's not acceptable as-is because of the risk of running out of shared
> memory for hashtable entries. In the existing code, there's a clear
> upper bound on the number of entries in the block-number-to-buffer hash
> table, ie, shared_buffers + 1 (the +1 because we acquire the new entry
> before releasing the old when reassigning a buffer). With multiple
> hashtables serving subsets of the buffers, the different tables might
> at different times need different numbers of entries, and that makes it
> a lot harder to be sure you won't run out of memory. I don't say it's
> insoluble, but the current patch wasn't even claimed to be safe by its
> author...
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-07-20 19:41:28
Message-ID: 9019.1153424488@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

A couple months ago we were discussing partitioning the buffer mapping
table so as to reduce contention for the BufMappingLock. The discussion
stalled after I complained about the uncertainty of shared memory usage
arising from having a separate hashtable for each partition (since
different numbers of buffers might fall into each partition at different
times). I've been thinking about it more after seeing an example from
Tatsuo that seems to be exactly the same problem as Gavin saw.

We could fix the uncertain-usage objection if we stick with a single
hashtable and ensure that concurrent access to different partitions of
it is safe. I believe this is easily doable, if we make hashtables
used in this mode allocate the maximum allowed number of buckets
immediately during hashtable initialization. (Since shared hashtables
already have a fixed maximum directory size, they already have an upper
bound on the number of buckets, so this loses no flexibility.) Then
there will be no on-the-fly bucket splits, and that means that accesses
to different hash buckets operate completely independently. Therefore,
we can regard the hashtable as logically partitioned on the basis of any
classification of entries that will uniquely assign hash buckets to
classes --- taking the low order bits of entry hash codes will do fine.

The only changeable state that is shared across all buckets is the entry
freelist and the "nentries" counter. We could protect these with a
spinlock (one spinlock is sufficient since changes to nentries go along
with addition or removal of freelist entries).

Usage of a partitioned hash table would then be like

compute hashtable lookup key;
entryhashcode = calc_hash(lookup key);
partitionnumber = entryhashcode % NumPartitions;
LWLockAcquire(PartitionLock[partitionnumber], ...);
manipulate hashtable;
LWLockRelease(PartitionLock[partitionnumber]);

We could do this without changing the API of hash_search, but then we'd
be computing the entry hashcode twice, so I'm inclined to provide an
additional entry point that takes a precalculated hashcode.

Potential downsides of applying this idea to the buffer mapping table:

1. Reassigning a buffer to a new page will (usually) require two cycles
of LWLockAcquire/Release for the two different partitions involved.
Since this path also requires at least a read() kernel call (maybe a
write() too), I don't think there'll be any meaningful slowdown.

2. The current logic for reassigning a buffer attempts to make a
hashtable entry for its new page number (to detect collisions) before
releasing the old hashtable entry. This would only be safe if we held
both partition LWLocks concurrently; which seems bad for concurrency,
plus avoiding deadlock would complicate the code significantly. I'm
inclined to release the old entry and then try to insert the new one,
holding only one lock at a time. If the insertion fails (because
someone was concurrently loading the same page), we'd have to throw the
buffer onto the freelist rather than allowing it to retain its previous
valid data. Which is slightly annoying, but in practice the case
probably doesn't happen often enough to be worth worrying about.

3. Taking the freelist spinlock is new computation that wasn't there
before. But, again, it's only needed in code paths that will also be
doing a kernel call.

If we do this we should probably also handle the lmgr lock tables the
same way (partially reverting my partition-the-LockMgrLock patch of a
couple months ago). However, downside #3 might be a stronger objection
for lmgr, since it can create or destroy lock objects without necessarily
doing any I/O.

Thoughts, objections?

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-07-21 12:54:50
Message-ID: 1153486490.2592.276.camel@holly
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2006-07-20 at 15:41 -0400, Tom Lane wrote:

> Usage of a partitioned hash table would then be like
>
> compute hashtable lookup key;
> entryhashcode = calc_hash(lookup key);
> partitionnumber = entryhashcode % NumPartitions;
> LWLockAcquire(PartitionLock[partitionnumber], ...);
> manipulate hashtable;
> LWLockRelease(PartitionLock[partitionnumber]);
>
> We could do this without changing the API of hash_search, but then we'd
> be computing the entry hashcode twice, so I'm inclined to provide an
> additional entry point that takes a precalculated hashcode.

That should be an additional win anyway, since hash_any() uses about 1%
CPU on tests I've seen - so we will hold locks slightly shorter
duration.

> Potential downsides of applying this idea to the buffer mapping table:
>
> 1. Reassigning a buffer to a new page will (usually) require two cycles
> of LWLockAcquire/Release for the two different partitions involved.
> Since this path also requires at least a read() kernel call (maybe a
> write() too), I don't think there'll be any meaningful slowdown.

> 3. Taking the freelist spinlock is new computation that wasn't there
> before. But, again, it's only needed in code paths that will also be
> doing a kernel call.

...So the additional overhead sounds acceptable, given we will save
somewhat on the hash_any()

> If we do this we should probably also handle the lmgr lock tables the
> same way (partially reverting my partition-the-LockMgrLock patch of a
> couple months ago). However, downside #3 might be a stronger objection
> for lmgr, since it can create or destroy lock objects without necessarily
> doing any I/O.

We should be in a position to test this soon.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-08-26 03:23:35
Message-ID: 200608260323.k7Q3NZZ02977@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is this being kept for 8.3?

---------------------------------------------------------------------------

Tom Lane wrote:
> A couple months ago we were discussing partitioning the buffer mapping
> table so as to reduce contention for the BufMappingLock. The discussion
> stalled after I complained about the uncertainty of shared memory usage
> arising from having a separate hashtable for each partition (since
> different numbers of buffers might fall into each partition at different
> times). I've been thinking about it more after seeing an example from
> Tatsuo that seems to be exactly the same problem as Gavin saw.
>
> We could fix the uncertain-usage objection if we stick with a single
> hashtable and ensure that concurrent access to different partitions of
> it is safe. I believe this is easily doable, if we make hashtables
> used in this mode allocate the maximum allowed number of buckets
> immediately during hashtable initialization. (Since shared hashtables
> already have a fixed maximum directory size, they already have an upper
> bound on the number of buckets, so this loses no flexibility.) Then
> there will be no on-the-fly bucket splits, and that means that accesses
> to different hash buckets operate completely independently. Therefore,
> we can regard the hashtable as logically partitioned on the basis of any
> classification of entries that will uniquely assign hash buckets to
> classes --- taking the low order bits of entry hash codes will do fine.
>
> The only changeable state that is shared across all buckets is the entry
> freelist and the "nentries" counter. We could protect these with a
> spinlock (one spinlock is sufficient since changes to nentries go along
> with addition or removal of freelist entries).
>
> Usage of a partitioned hash table would then be like
>
> compute hashtable lookup key;
> entryhashcode = calc_hash(lookup key);
> partitionnumber = entryhashcode % NumPartitions;
> LWLockAcquire(PartitionLock[partitionnumber], ...);
> manipulate hashtable;
> LWLockRelease(PartitionLock[partitionnumber]);
>
> We could do this without changing the API of hash_search, but then we'd
> be computing the entry hashcode twice, so I'm inclined to provide an
> additional entry point that takes a precalculated hashcode.
>
> Potential downsides of applying this idea to the buffer mapping table:
>
> 1. Reassigning a buffer to a new page will (usually) require two cycles
> of LWLockAcquire/Release for the two different partitions involved.
> Since this path also requires at least a read() kernel call (maybe a
> write() too), I don't think there'll be any meaningful slowdown.
>
> 2. The current logic for reassigning a buffer attempts to make a
> hashtable entry for its new page number (to detect collisions) before
> releasing the old hashtable entry. This would only be safe if we held
> both partition LWLocks concurrently; which seems bad for concurrency,
> plus avoiding deadlock would complicate the code significantly. I'm
> inclined to release the old entry and then try to insert the new one,
> holding only one lock at a time. If the insertion fails (because
> someone was concurrently loading the same page), we'd have to throw the
> buffer onto the freelist rather than allowing it to retain its previous
> valid data. Which is slightly annoying, but in practice the case
> probably doesn't happen often enough to be worth worrying about.
>
> 3. Taking the freelist spinlock is new computation that wasn't there
> before. But, again, it's only needed in code paths that will also be
> doing a kernel call.
>
> If we do this we should probably also handle the lmgr lock tables the
> same way (partially reverting my partition-the-LockMgrLock patch of a
> couple months ago). However, downside #3 might be a stronger objection
> for lmgr, since it can create or destroy lock objects without necessarily
> doing any I/O.
>
> Thoughts, objections?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
Subject: Re: Further reduction of bufmgr lock contention
Date: 2006-08-26 15:44:53
Message-ID: 12886.1156607093@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Is this being kept for 8.3?

No, it was committed a month ago.

regards, tom lane