Quick Links

MIT benchmarks pgsql multicore (up to 48)performance

Lists:	pgsql-hackerspgsql-performance

From:	Hakan Kocaman <hkocam(at)googlemail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 14:44:23
Message-ID:	AANLkTimW2UgVPGw6MRUBj9HabvwmsmkZasL3hKv-nUJ2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi,

for whom it may concern:
http://pdos.csail.mit.edu/mosbench/

They tested with 8.3.9, i wonder what results 9.0 would give.

Best regards and keep up the good work

Hakan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Hakan Kocaman <hkocam(at)googlemail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 17:13:36
Message-ID:	AANLkTi=33M9adSmdBQuYBeeaWX-Z=c9+DcUp8oRRG_Ru@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Oct 4, 2010 at 10:44 AM, Hakan Kocaman <hkocam(at)googlemail(dot)com> wrote:
> for whom it may concern:
> http://pdos.csail.mit.edu/mosbench/
> They tested with 8.3.9, i wonder what results 9.0 would give.
> Best regards and keep up the good work
> Hakan

Here's the most relevant bit to us:

--
The “Stock” line in Figures 7 and 8 shows that Post- greSQL has poor
scalability on the stock kernel. The first bottleneck we encountered,
which caused the read/write workload’s total throughput to peak at
only 28 cores, was due to PostgreSQL’s design. PostgreSQL implements
row- and table-level locks atop user-level mutexes; as a result, even
a non-conflicting row- or table-level lock acquisition requires
exclusively locking one of only 16 global mutexes. This leads to
unnecessary contention for non-conflicting acquisitions of the same
lock—as seen in the read/write workload—and to false contention
between unrelated locks that hash to the same exclusive mutex. We
address this problem by rewriting PostgreSQL’s row- and table-level
lock manager and its mutexes to be lock-free in the uncontended case,
and by increasing the number of mutexes from 16 to 1024.
--

I believe the "one of only 16 global mutexes" comment is referring to
NUM_LOCK_PARTITIONS (there's also NUM_BUFFER_PARTITIONS, but that
wouldn't be relevant for row and table-level locks). Increasing that
from 16 to 1024 wouldn't be free and it's not clear to me that they've
done anything to work around the downsides of such a change. Perhaps
it's worthwhile anyway on a 48-core machine! The use of lock-free
techniques seems quite interesting; unfortunately, I know next to
nothing about the topic and this paper doesn't provide much of an
introduction. Anyone have a reference to a good introductory paper on
the topic?

The other sort of interesting thing that they mention is that
apparently I/O between shared buffers and the underlying data files
causes a lot of kernel contention due to inode locks induced by
lseek(). There's nothing much we can do about that within PG but
surely it would be nice if it got fixed upstream.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Michael Glaesemann <grzm(at)seespotcode(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hakan Kocaman <hkocam(at)googlemail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 17:38:33
Message-ID:	1D0057B0-7151-425C-8096-56836F9AB4BD@seespotcode.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Oct 4, 2010, at 13:13 , Robert Haas wrote:

> On Mon, Oct 4, 2010 at 10:44 AM, Hakan Kocaman <hkocam(at)googlemail(dot)com> wrote:
>> for whom it may concern:
>> http://pdos.csail.mit.edu/mosbench/
>> They tested with 8.3.9, i wonder what results 9.0 would give.
>> Best regards and keep up the good work
>> Hakan
>
> Here's the most relevant bit to us:

<snip/>

> The use of lock-free
> techniques seems quite interesting; unfortunately, I know next to
> nothing about the topic and this paper doesn't provide much of an
> introduction. Anyone have a reference to a good introductory paper on
> the topic?

The README in the postgres section of the git repo leads me to think the code that includes the fixes it there, if someone wants to look into it (wrt to the Postgres lock manager changes). Didn't check the licensing.

Michael Glaesemann
grzm seespotcode net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Glaesemann <grzm(at)seespotcode(dot)net>
Cc:	Hakan Kocaman <hkocam(at)googlemail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 17:47:48
Message-ID:	AANLkTi=XWYZhPt+zvJYnSx_orrnjgU7Af=jyZF=MsBW5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Oct 4, 2010 at 1:38 PM, Michael Glaesemann <grzm(at)seespotcode(dot)net> wrote:
>
> On Oct 4, 2010, at 13:13 , Robert Haas wrote:
>
>> On Mon, Oct 4, 2010 at 10:44 AM, Hakan Kocaman <hkocam(at)googlemail(dot)com> wrote:
>>> for whom it may concern:
>>> http://pdos.csail.mit.edu/mosbench/
>>> They tested with 8.3.9, i wonder what results 9.0 would give.
>>> Best regards and keep up the good work
>>> Hakan
>>
>> Here's the most relevant bit to us:
>
> <snip/>
>
>> The use of lock-free
>> techniques seems quite interesting; unfortunately, I know next to
>> nothing about the topic and this paper doesn't provide much of an
>> introduction. Anyone have a reference to a good introductory paper on
>> the topic?
>
> The README in the postgres section of the git repo leads me to think the code that includes the fixes it there, if someone wants to look into it (wrt to the Postgres lock manager changes). Didn't check the licensing.

It does, but it's a bunch of x86-specific hacks that breaks various
important features and include comments like "use usual technique for
lock-free thingamabob". So even if the licensing is/were suitable,
the code's not usable. I think the paper is neat from the point of
view of providing us with some information about where the scalability
bottlenecks might be on hardware to which most of us don't have easy
access, but as far as the implementation goes I think we're on our
own.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Hakan Kocaman <hkocam(at)googlemail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 17:55:45
Message-ID:	20101004175545.GA2690@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I wasn't involved in this work but I do know a bit about it. Sadly, the
work on Postgres performance was cut down to under a page, complete
with the amazing offhand mention of "rewriting PostgreSQL's lock
manager". Here are a few more details...

The benchmarks in this paper are all about stressing the kernel. The
database is entirely in memory -- it's stored on tmpfs rather than on
disk, and it fits within shared_buffers. The workload consists of index
lookups and inserts on a single table. You can fill in all the caveats
about what conclusions can and cannot be drawn from this workload.

The big takeaway for -hackers, I think, is that lock manager
performance is going to be an issue for large multicore systems, and
the uncontended cases need to be lock-free. That includes cases where
multiple threads are trying to acquire the same lock in compatible
modes.

Currently even acquiring a shared heavyweight lock requires taking out
an exclusive LWLock on the partition, and acquiring shared LWLocks
requires acquiring a spinlock. All of this gets more expensive on
multicores, where even acquiring spinlocks can take longer than the
work being done in the critical section.

Their modifications to Postgres should be available in the code that
was published last night. As I understand it, the approach is to
implement LWLocks with atomic operations on a counter that contains
both the exclusive and shared lock count. Heavyweight locks do
something similar but with counters for each lock mode packed into a
word.

Note that their implementation of the lock manager omits some features
for simplicity, like deadlock detection, 2PC, and probably any
semblance of portability. (These are the sort of things we're allowed
to do in the research world! :-)

The other major bottleneck they ran into was a kernel one: reading from
the heap file requires a couple lseek operations, and Linux acquires a
mutex on the inode to do that. The proper place to fix this is
certainly in the kernel but it may be possible to work around in
Postgres.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hakan Kocaman <hkocam(at)googlemail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 18:06:27
Message-ID:	AANLkTi=GTZEW5JDUcbmq1rrekdszJ0RPfAh0GUueFXpf@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Here's a video on lock-free hashing for example:

http://video.google.com/videoplay?docid=2139967204534450862#

I guess by "lock-free in the uncontended case" they mean the buffer
cache manager is lock-free unless you're actually contending on the
same buffer?

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Hakan Kocaman <hkocam(at)googlemail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 18:35:42
Message-ID:	20101004183542.GB2690@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Oct 04, 2010 at 01:13:36PM -0400, Robert Haas wrote:
> I believe the "one of only 16 global mutexes" comment is referring to
> NUM_LOCK_PARTITIONS (there's also NUM_BUFFER_PARTITIONS, but that
> wouldn't be relevant for row and table-level locks).

Yes -- my understanding is that they hit two lock-related problems:
1) LWLock contention caused by acquiring the same lock in compatible
modes (e.g. multiple shared locks)
2) false contention caused by acquiring two locks that hashed to the
same partition
and the first was the worse problem. The lock-free structures helpe
with both, so the impact of changing NUM_LOCK_PARTITIONS was less
interesting.

Dan

--
Dan R. K. Ports MIT CSAIL http://drkp.net/

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 18:49:43
Message-ID:	4CAA21C7.1030701@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dan,

(btw, OpenSQL Confererence is going to be at MIT in 2 weeks. Think
anyone from the MOSBENCH team could attend?
http://www.opensqlcamp.org/Main_Page)

> The big takeaway for -hackers, I think, is that lock manager
> performance is going to be an issue for large multicore systems, and
> the uncontended cases need to be lock-free. That includes cases where
> multiple threads are trying to acquire the same lock in compatible
> modes.

Yes; we were aware of this due to work Jignesh did at Sun on TPC-E.

> Currently even acquiring a shared heavyweight lock requires taking out
> an exclusive LWLock on the partition, and acquiring shared LWLocks
> requires acquiring a spinlock. All of this gets more expensive on
> multicores, where even acquiring spinlocks can take longer than the
> work being done in the critical section.

Certainly, the question has always been how to fix it without breaking
major features and endangering data integrity.

> Note that their implementation of the lock manager omits some features
> for simplicity, like deadlock detection, 2PC, and probably any
> semblance of portability. (These are the sort of things we're allowed
> to do in the research world! :-)

Well, nice that you did! We'd never have that much time to experiment
with non-production stuff as a group in the project. So, now we have a
theoretical solution which we can look at maybe implementing parts of in
some watered-down form.

> The other major bottleneck they ran into was a kernel one: reading from
> the heap file requires a couple lseek operations, and Linux acquires a
> mutex on the inode to do that. The proper place to fix this is
> certainly in the kernel but it may be possible to work around in
> Postgres.

Or we could complain to Kernel.org. They've been fairly responsive in
the past. Too bad this didn't get posted earlier; I just got back from
LinuxCon.

So you know someone who can speak technically to this issue? I can put
them in touch with the Linux geeks in charge of that part of the kernel
code.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Dan Ports <drkp(at)csail(dot)mit(dot)edu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Hakan Kocaman <hkocam(at)googlemail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 19:22:32
Message-ID:	A28AB36A-BA44-45FF-BC05-B8591B71827B@csail.mit.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Oct 4, 2010, at 11:06, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> I guess by "lock-free in the uncontended case" they mean the buffer
> cache manager is lock-free unless you're actually contending on the
> same buffer?

That refers to being able to acquire non-conflicting row/table locks without needing an exclusive LWLock, and acquiring shared LWLocks without spinlocks if possible.

I think the buffer cache manager is the next bottleneck after the row/table lock manager. Seems like it would also be a good candidate for similar techniques, but that's totally uninformed speculation on my part.

Dan

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Hakan Kocaman <hkocam(at)googlemail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-04 19:35:44
Message-ID:	AANLkTikGSfJEj1yyxG9PKLu4EP-rn9kza8fmhqhMCOVS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Oct 4, 2010 at 8:44 AM, Hakan Kocaman <hkocam(at)googlemail(dot)com> wrote:
> Hi,
> for whom it may concern:
> http://pdos.csail.mit.edu/mosbench/
> They tested with 8.3.9, i wonder what results 9.0 would give.
> Best regards and keep up the good work

They mention that these tests were run on the older 8xxx series
opterons which has much slower memory speed and HT speed as well. I
wonder how much better the newer 6xxx series magny cours would have
done on it... When I tested some simple benchmarks like pgbench, I
got scalability right to 48 processes on our 48 core magny cours
machines.

Still, lots of room for improvement in kernel and pgsql.

--
To understand recursion, one must first understand recursion.

From:	Ivan Voras <ivoras(at)freebsd(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-06 22:31:19
Message-ID:	i8itbn$kre$1@dough.gmane.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10/04/10 20:49, Josh Berkus wrote:

>> The other major bottleneck they ran into was a kernel one: reading from
>> the heap file requires a couple lseek operations, and Linux acquires a
>> mutex on the inode to do that. The proper place to fix this is
>> certainly in the kernel but it may be possible to work around in
>> Postgres.
>
> Or we could complain to Kernel.org. They've been fairly responsive in
> the past. Too bad this didn't get posted earlier; I just got back from
> LinuxCon.
>
> So you know someone who can speak technically to this issue? I can put
> them in touch with the Linux geeks in charge of that part of the kernel
> code.

Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
cannot be fixed since you're modifying the global "strean position"
variable and something has got to lock that.

OTOH, pread() / pwrite() don't have to do that.

From:	Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To:
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-06 22:34:20
Message-ID:	AANLkTimpueKcj3VXR30Ecse5sc55qk8b_vsc_4xgxuG-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Oct 6, 2010 at 5:31 PM, Ivan Voras <ivoras(at)freebsd(dot)org> wrote:
> On 10/04/10 20:49, Josh Berkus wrote:
>
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that. The proper place to fix this is
>>> certainly in the kernel but it may be possible to work around in
>>> Postgres.
>>
>> Or we could complain to Kernel.org. They've been fairly responsive in
>> the past. Too bad this didn't get posted earlier; I just got back from
>> LinuxCon.
>>
>> So you know someone who can speak technically to this issue? I can put
>> them in touch with the Linux geeks in charge of that part of the kernel
>> code.
>
> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.
>
> OTOH, pread() / pwrite() don't have to do that.

While lseek is very "cheap" it is like any other system call in that
when you multiple "cheap" times "a jillion" you end up with "notable"
or even "lots". I've personally seen notable performance improvements
by switching to pread/pwrite instead of lseek+{read,write}. For
platforms that don't implement pread or pwrite, wrapper calls are
trivial to produce. One less system call is, in this case, 50% fewer.

--
Jon

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ivan Voras <ivoras(at)freebsd(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 00:39:48
Message-ID:	AANLkTikXHJL+u9OdgiqCBQLB-bMcrkZeB1eEeyZqBPvA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras <ivoras(at)freebsd(dot)org> wrote:
> On 10/04/10 20:49, Josh Berkus wrote:
>
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that. The proper place to fix this is
>>> certainly in the kernel but it may be possible to work around in
>>> Postgres.
>>
>> Or we could complain to Kernel.org. They've been fairly responsive in
>> the past. Too bad this didn't get posted earlier; I just got back from
>> LinuxCon.
>>
>> So you know someone who can speak technically to this issue? I can put
>> them in touch with the Linux geeks in charge of that part of the kernel
>> code.
>
> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.

Well, there are lock free algorithms using CAS, no?

> OTOH, pread() / pwrite() don't have to do that.

Hey, I didn't know about those. That sounds like it might be worth
investigating, though I confess I lack a 48-core machine on which to
measure the alleged benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ivan Voras <ivoras(at)freebsd(dot)org>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 01:25:07
Message-ID:	16636.1286414707@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Ivan Voras <ivoras(at)freebsd(dot)org> writes:
> On 10/04/10 20:49, Josh Berkus wrote:
>>> The other major bottleneck they ran into was a kernel one: reading from
>>> the heap file requires a couple lseek operations, and Linux acquires a
>>> mutex on the inode to do that.

> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
> cannot be fixed since you're modifying the global "strean position"
> variable and something has got to lock that.

Um, there is no "global stream position" associated with an inode.
A file position is associated with an open-file descriptor.

If Josh quoted the problem correctly, the issue is that the kernel is
locking a file's inode (which may be referenced by quite a lot of file
descriptors) in order to change the state of one file descriptor.
It sure sounds like a possible source of contention to me.

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 01:30:12
Message-ID:	20101007013012.GN26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> Hey, I didn't know about those. That sounds like it might be worth
> investigating, though I confess I lack a 48-core machine on which to
> measure the alleged benefit.

I've got a couple 24-core systems, if it'd be sufficiently useful to
test with..

Stephen

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 02:01:20
Message-ID:	AANLkTi=AfTAwMv=WXdZeLzY0NNBZ1sJHDsDK+wmRMcVS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Oct 6, 2010 at 9:30 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>> Hey, I didn't know about those. That sounds like it might be worth
>> investigating, though I confess I lack a 48-core machine on which to
>> measure the alleged benefit.
>
> I've got a couple 24-core systems, if it'd be sufficiently useful to
> test with..

It's good to be you.

I don't suppose you could try to replicate the lseek() contention?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 02:07:07
Message-ID:	20101007020707.GP26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> It's good to be you.

They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM.
Unfortunately, they currently only have local storage, but it seems
unlikely that would be an issue for this.

> I don't suppose you could try to replicate the lseek() contention?

I can give it a shot, but the impression I had from the paper is that
the lseek() contention wouldn't be seen without the changes to the lock
manager...? Or did I misunderstand?

Thanks,

Stephen

From:	Ivan Voras <ivoras(at)freebsd(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 12:19:08
Message-ID:	AANLkTim9OcDweLHqqVxWVvbd06p+Vgx7oFHsOzJnqhPc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 7 October 2010 03:25, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Ivan Voras <ivoras(at)freebsd(dot)org> writes:
>> On 10/04/10 20:49, Josh Berkus wrote:
>>>> The other major bottleneck they ran into was a kernel one: reading from
>>>> the heap file requires a couple lseek operations, and Linux acquires a
>>>> mutex on the inode to do that.
>
>> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
>> cannot be fixed since you're modifying the global "strean position"
>> variable and something has got to lock that.
>
> Um, there is no "global stream position" associated with an inode.
> A file position is associated with an open-file descriptor.

You're right of course, I was pattern matching late last night on the
"lseek()" and "locking problems" keywords and ignored "inode".

> If Josh quoted the problem correctly, the issue is that the kernel is
> locking a file's inode (which may be referenced by quite a lot of file
> descriptors) in order to change the state of one file descriptor.
> It sure sounds like a possible source of contention to me.

Though it does depend on the details of how pg uses it. Forked
processes share their parents' file descriptors.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 12:33:07
Message-ID:	AANLkTi=yhgL0KGmwx7SK20nBv4jqFufcscutvufkT+CN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Oct 6, 2010 at 10:07 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
>> It's good to be you.
>
> They're HP BL465 G7's w/ 2x 12-core AMD processors and 48G of RAM.
> Unfortunately, they currently only have local storage, but it seems
> unlikely that would be an issue for this.
>
>> I don't suppose you could try to replicate the lseek() contention?
>
> I can give it a shot, but the impression I had from the paper is that
> the lseek() contention wouldn't be seen without the changes to the lock
> manager...? Or did I misunderstand?

Looks like the lock manager problems hit at 28 cores, and the lseek
problems at 36 cores. So your system might not even be big enough to
manifest either problem.

It's unclear to me whether a 48-core system would be able to see the
lseek issues without improvements to the lock manager, but perhaps it
would be possible by, say, increasing the number of lock partitions by
8x. It would be nice to segregate these issues though, because using
pread/pwrite is probably a lot less work than rewriting our lock
manager. Do you have tools to measure the lseek overhead? If so, we
could prepare a patch to use pread()/pwrite() and just see whether
that reduced the overhead, without worrying so much about whether it
was actually a major bottleneck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Ivan Voras <ivoras(at)freebsd(dot)org>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 12:47:06
Message-ID:	i8kfft$e5j$1@dough.gmane.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10/07/10 02:39, Robert Haas wrote:
> On Wed, Oct 6, 2010 at 6:31 PM, Ivan Voras<ivoras(at)freebsd(dot)org> wrote:
>> On 10/04/10 20:49, Josh Berkus wrote:
>>
>>>> The other major bottleneck they ran into was a kernel one: reading from
>>>> the heap file requires a couple lseek operations, and Linux acquires a
>>>> mutex on the inode to do that. The proper place to fix this is
>>>> certainly in the kernel but it may be possible to work around in
>>>> Postgres.
>>>
>>> Or we could complain to Kernel.org. They've been fairly responsive in
>>> the past. Too bad this didn't get posted earlier; I just got back from
>>> LinuxCon.
>>>
>>> So you know someone who can speak technically to this issue? I can put
>>> them in touch with the Linux geeks in charge of that part of the kernel
>>> code.
>>
>> Hmmm... lseek? As in "lseek() then read() or write()" idiom? It AFAIK
>> cannot be fixed since you're modifying the global "strean position"
>> variable and something has got to lock that.
>
> Well, there are lock free algorithms using CAS, no?

Nothing is really "lock free" - in this case the algorithms simply push
the locking down to atomic operations on the CPU (and the memory bus).
Semantically, *something* has to lock the memory region for however
brief period of time and then propagate that update to other CPUs'
caches (i.e. invalidate them).

>> OTOH, pread() / pwrite() don't have to do that.
>
> Hey, I didn't know about those. That sounds like it might be worth
> investigating, though I confess I lack a 48-core machine on which to
> measure the alleged benefit.

As Jon said, it will in any case reduce the number of these syscalls by
half, and they can be wrapped by a C macro for the platforms which don't
implement them.

http://man.freebsd.org/pread

(and just in case it's needed: pread() is a special case of preadv()).

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Cc:	"Ivan Voras" <ivoras(at)freebsd(dot)org>, <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 17:21:21
Message-ID:	4CADBB410200002500036644@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> perhaps it would be possible by, say, increasing the number of
> lock partitions by 8x. It would be nice to segregate these issues
> though, because using pread/pwrite is probably a lot less work
> than rewriting our lock manager.

You mean easier than changing this 4 to a 7?:

#define LOG2_NUM_LOCK_PARTITIONS 4

Or am I missing something?

-Kevin

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 18:06:20
Message-ID:	20101007180620.GY26232@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

* Kevin Grittner (Kevin(dot)Grittner(at)wicourts(dot)gov) wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > perhaps it would be possible by, say, increasing the number of
> > lock partitions by 8x. It would be nice to segregate these issues
> > though, because using pread/pwrite is probably a lot less work
> > than rewriting our lock manager.
>
> You mean easier than changing this 4 to a 7?:
>
> #define LOG2_NUM_LOCK_PARTITIONS 4
>
> Or am I missing something?

I'm pretty sure we were talking about the change described in the paper
of moving to a system which uses atomic changes instead of spinlocks for
certain locking situations..

If that's all the MIT folks did, they certainly made it sound like alot
more. :)

Stephen

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Stephen Frost" <sfrost(at)snowman(dot)net>
Cc:	"Ivan Voras" <ivoras(at)freebsd(dot)org>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 18:22:02
Message-ID:	4CADC97A0200002500036662@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Kevin Grittner (Kevin(dot)Grittner(at)wicourts(dot)gov) wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>>> perhaps it would be possible by, say, increasing the number of
>>> lock partitions by 8x.

>> changing this 4 to a 7?:
>>
>> #define LOG2_NUM_LOCK_PARTITIONS 4

> I'm pretty sure we were talking about the change described in the
> paper of moving to a system which uses atomic changes instead of
> spinlocks for certain locking situations..

Well, they also mentioned increasing the number of lock partitions
to reduce contention, and that seemed to be what Robert was talking
about in the quoted section.

Of course, that's not the *only* thing they did; it's just the point
which seemed to be under discussion just there.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] MIT benchmarks pgsql multicore (up to 48)performance
Date:	2010-10-07 20:31:36
Message-ID:	AANLkTikKaNhHHimcO1XPdwvTCa-HYH4UJAGwgHjTKaQM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Oct 7, 2010 at 1:21 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> perhaps it would be possible by, say, increasing the number of
>> lock partitions by 8x. It would be nice to segregate these issues
>> though, because using pread/pwrite is probably a lot less work
>> than rewriting our lock manager.
>
> You mean easier than changing this 4 to a 7?:
>
> #define LOG2_NUM_LOCK_PARTITIONS 4
>
> Or am I missing something?

Right. They did something more complicated (and, I think, better)
than that, but that change by itself might be enough to ameliorate the
lock contention enough to see the lsek() issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company