Quick Links

Re: mosbench revisited

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	mosbench revisited
Date:	2011-08-03 18:21:25
Message-ID:	CA+TgmoZWdo9XrH=TN59GX8rJM9FgiezpAA-B57ZEVOGof49FVA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

About nine months ago, we had a discussion of some benchmarking that
was done by the mosbench folks at MIT:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

Although the authors used PostgreSQL as a test harness for driving
load, it's pretty clear from reading the paper that their primary goal
was to stress the Linux kernel, so the applicability of the paper to
real-world PostgreSQL performance improvement is less than it might
be. Still, having now actually investigated in some detail many of
the same performance issues that they were struggling with, I have a
much clearer understanding of what's really going on here. In
PostgreSQL terms, here are the bottlenecks they ran into:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches." This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads. In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter. I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

2. "PostgreSQL implements row- and table-level locks atop user-level
mutexes; as a result, even a non-conflicting row- or table-level lock
acquisition requires exclusively locking one of only 16 global
mutexes." I think that the reference to row-level locks here is a red
herring; or at least, I haven't seen any evidence that row-level
locking is a meaningful source of contention on any workload I've
tested. Table-level locks clearly are, and this is the problem that
the now-committed fastlock patch addressed. So, fixed!

3. "Our workload creates one PostgreSQL connection per server core and
sends queries (selects or updates) in batches of 256, aggregating
successive read-only transac- tions into single transactions. This
workload is intended to minimize application-level contention within
PostgreSQL in order to maximize the stress PostgreSQL places on the
kernel." I had no idea what this was talking about at the time, but
it's now obvious in retrospect that they were working around the
overhead imposed by acquiring and releasing relation and virtualxid
locks. My pending "lazy vxids" patch will address the remaining issue
here.

4. "With modified PostgreSQL on stock Linux, throughput for both
workloads collapses at 36 cores .. The main reason is the kernel's
lseek implementation." With the fastlock, sinval-hasmessages, and
lazy-vxid patches applied (the first two are committed now), it's now
much easier to run headlong into this bottleneck. Prior to those
patches, for this to be an issue, you would need to batch your queries
together in big groups to avoid getting whacked by the lock manager
and/or sinval overhead first. With those problems and the recently
discovered bottleneck in glibc's random() implementation fixed, good
old pgbench -S is enough to hit this problem if you have enough
clients and enough cores. And it turns out that the word "collapse"
is not an exaggeration. On a 64-core Intel box running RHEL 6.1,
performance ramped up from 24k TPS at 4 clients to 175k TPS at 32
clients and then to 207k TPS at 44 clients. After that it fell off a
cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients,
consuming truly horrifying amounts of system time in the process. A
somewhat tedious investigation revealed that the problem is, in fact,
contention on the inode mutex caused by lseek(). Results are much
better with -M prepared (310k TPS at 48 clients, 294k TPS at 64
clients). All one-minute tests with scale factor 100, fitting inside
8GB of shared_buffers (clearly not enough for serious benchmarking,
but enough to demonstrate this issue).

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock. A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected. That now
seems unfortunate. Barring a kernel-level fix, we could try to
implement our own cache to work around this problem. However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

5. With all of the above problems fixed or worked around, the authors
write, "PostgreSQL's overall scalability is primarily limited by
contention for the spinlock protecting the buffer cache page for the
root of the table index". This is the only problem on their list that
I haven't yet encountered in testing. I'm kind of interested by the
result, actually, as I had feared that the spinlock protecting
ProcArrayLock was going to be a bigger problem sooner. But maybe not.
I'm also concerned about the spinlock protecting the buffer mapping
lock that covers the root index page. I'll investigate further if and
when I come up with a way to dodge the lseek() contention problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 18:41:29
Message-ID:	20110803184128.GC24821@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
> It would be nice if the Linux guys would fix this problem for us, but
> I'm not sure whether they will. For those who may be curious, the
> problem is in generic_file_llseek() in fs/read-write.c. On a platform
> with 8-byte atomic reads, it seems like it ought to be very possible
> to read inode->i_size without taking a spinlock.

Interesting. There's this thread from 2003 suggesting the use of pread
instead, it was rejected on the argument that lseek is cheap so not a
problem.

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

Perhaps we now have a benchmark where the effect can be measured.

There's the issue about whether it screws up the readahead mechanism...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 18:49:31
Message-ID:	21344.1312397371@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
>> It would be nice if the Linux guys would fix this problem for us, but
>> I'm not sure whether they will. For those who may be curious, the
>> problem is in generic_file_llseek() in fs/read-write.c. On a platform
>> with 8-byte atomic reads, it seems like it ought to be very possible
>> to read inode->i_size without taking a spinlock.

> Interesting. There's this thread from 2003 suggesting the use of pread
> instead, it was rejected on the argument that lseek is cheap so not a
> problem.

> http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

That seems rather unrelated. The point here is our use of lseek to find
out the current file size --- or at least, I would hope they're not
trying to read the inode's file size in a SEEK_CUR call.

The reason "-M prepared" helps is presumably that it eliminates most of
the RelationGetNumberOfBlocks calls the planner does to check current
table size. While we could certainly consider using a cheaper (possibly
more stale) value there, it's a bit astonishing to think that that's the
main cost in a parse/plan/execute cycle. Perhaps there are more hotspot
calls than that one?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 19:11:44
Message-ID:	CA+TgmobywULdmOh8Yqc8YHO9qvqD_XASeiLsXCR3mFnDmiXrTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 2:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>> On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
>>> It would be nice if the Linux guys would fix this problem for us, but
>>> I'm not sure whether they will. For those who may be curious, the
>>> problem is in generic_file_llseek() in fs/read-write.c. On a platform
>>> with 8-byte atomic reads, it seems like it ought to be very possible
>>> to read inode->i_size without taking a spinlock.
>
>> Interesting. There's this thread from 2003 suggesting the use of pread
>> instead, it was rejected on the argument that lseek is cheap so not a
>> problem.
>
>> http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php
>
> That seems rather unrelated. The point here is our use of lseek to find
> out the current file size --- or at least, I would hope they're not
> trying to read the inode's file size in a SEEK_CUR call.

Correct.

> The reason "-M prepared" helps is presumably that it eliminates most of
> the RelationGetNumberOfBlocks calls the planner does to check current
> table size. While we could certainly consider using a cheaper (possibly
> more stale) value there, it's a bit astonishing to think that that's the
> main cost in a parse/plan/execute cycle. Perhaps there are more hotspot
> calls than that one?

Nope.

On a straight pgbench -S test, you get four system calls per query:
recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates
the two lseeks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 19:38:50
Message-ID:	22145.1312400330@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On a straight pgbench -S test, you get four system calls per query:
> recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates
> the two lseeks.

[ scratches head... ] Two? Is that one for the table and one for its
lone index, or are we being redundant there?

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 20:10:28
Message-ID:	CA+TgmoZi262u_mX3DPt9bRDvMyn3_Wky5NNpE9JR+ky-7icHYw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On a straight pgbench -S test, you get four system calls per query:
>> recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates
>> the two lseeks.
>
> [ scratches head... ] Two?

Yep.

> Is that one for the table and one for its
> lone index, or are we being redundant there?

The former. Specifically, it appears we're smart enough to only test
the last segment (in this case, the table is large enough that there
is a .1 file, and that's what we're lseeking).

> (If the query ended up being a seqscan, I'd expect a second
> lseek(SEEK_END) when the executor starts up, but I gather from the other
> complaints that the mosbench people were only testing simple indexscan
> queries.)

Yeah, it seems that for a sequential scan we lseek the heap, then the
index, then the heap again; but for index scans we just hit the heap
and the index.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 20:38:16
Message-ID:	23078.1312403896@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> (If the query ended up being a seqscan, I'd expect a second
>> lseek(SEEK_END) when the executor starts up, but I gather from the other
>> complaints that the mosbench people were only testing simple indexscan
>> queries.)

> Yeah, it seems that for a sequential scan we lseek the heap, then the
> index, then the heap again; but for index scans we just hit the heap
> and the index.

Sure. The first two come from the planner getting the table and index
sizes for estimation purposes (look in plancat.c). The last is done in
heapam.c's initscan(). We could possibly accept stale values for the
planner estimates, but I think heapam's number had better be accurate.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 21:04:56
Message-ID:	CA+TgmobCpTDeqyBBuy-83b5gxe=qK1gT7afV_Ck4nv27e2yW0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> (If the query ended up being a seqscan, I'd expect a second
>>> lseek(SEEK_END) when the executor starts up, but I gather from the other
>>> complaints that the mosbench people were only testing simple indexscan
>>> queries.)
>
>> Yeah, it seems that for a sequential scan we lseek the heap, then the
>> index, then the heap again; but for index scans we just hit the heap
>> and the index.
>
> Sure. The first two come from the planner getting the table and index
> sizes for estimation purposes (look in plancat.c). The last is done in
> heapam.c's initscan(). We could possibly accept stale values for the
> planner estimates, but I think heapam's number had better be accurate.

I think the exact requirement is that, if the relation turns out to be
larger than the size we read, the extra blocks had better not contain
any tuples our snapshot can see. There's actually no interlock
between smgrnblocks() and smgrextend() right now, so presumably we
don't need to add one. However, a value cached from a few seconds ago
is clearly not going to cut it.

I don't really think there's any sensible way to implement a
per-backend cache, because that would require invalidation events of
some kind to be sent on relation extension, and that seems utterly
insane from a performance standpoint, even if we invented something
less expensive than sinval. I guess it might work for planning
purposes if you only sent out invalidation events on every N'th
extension or something, but penalizing the accuracy of planning to
work around a Linux kernel bug that only manifests itself on machines
with >32 cores doesn't seem very appealing.

A shared cache seems like it could work, but the locking is tricky.
Normally we'd just use a hash table protected by an LWLock, one one
LWLock per partition, but here that's clearly not going to work. The
kernel is using a spinlock per file, and that's still too
heavy-weight. I think that if we could prepopulate the cache with all
the keys (i.e. relfilenodes) we care about and then never add or evict
any, we could run it completely unlocked. I believe that the existing
memory barriers in things like LWLockAcquire() would be sufficient to
prevent us from reading a too-old value (i.e. block count). In
particular, you couldn't read a value that predated your snapshot, if
you got your snapshot by holding ProcArrayLock. But the races
involved with adding and removing items from the cache are hard to
deal with without using locks, especially because the keys are 12
bytes or more and therefore can't be read or written atomically.

I've been mulling over how we might deal with this and actually coded
up an implementation, but it turns out (surprise, surprise) to have
problems with insufficient locking. So I'm thinking it over some
more. And hoping that the Linux guys decide to do something about it.
This isn't really our bug - lseek is quite cheap in the uncontended
case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 21:35:57
Message-ID:	23924.1312407357@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... We could possibly accept stale values for the
>> planner estimates, but I think heapam's number had better be accurate.

> I think the exact requirement is that, if the relation turns out to be
> larger than the size we read, the extra blocks had better not contain
> any tuples our snapshot can see. There's actually no interlock
> between smgrnblocks() and smgrextend() right now, so presumably we
> don't need to add one.

No interlock in userspace, you mean. We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

> I don't really think there's any sensible way to implement a
> per-backend cache, because that would require invalidation events of
> some kind to be sent on relation extension, and that seems utterly
> insane from a performance standpoint, even if we invented something
> less expensive than sinval.

Yeah, that's the issue. But "relation extension" is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem. I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

> A shared cache seems like it could work, but the locking is tricky.
> Normally we'd just use a hash table protected by an LWLock, one one
> LWLock per partition, but here that's clearly not going to work. The
> kernel is using a spinlock per file, and that's still too
> heavy-weight.

That still seems utterly astonishing to me. We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch? How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.

regards, tom lane

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-03 22:21:53
Message-ID:	E5EE8D60-F035-4474-95CC-9124D88F0C67@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
> because PostgreSQL protects its free-list with a single lock and thus
> scales poorly with smaller caches." This is a complaint about
> BufFreeList lock which, in fact, I've seen as a huge point of
> contention on some workloads. In fact, on read-only workloads, with
> my lazy vxid lock patch applied, this is, I believe, the only
> remaining unpartitioned LWLock that is ever taken in exclusive mode;
> or at least the only one that's taken anywhere near often enough to
> matter. I think we're going to do something about this, although I
> don't have a specific idea in mind at the moment.

This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).

The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-04 01:16:14
Message-ID:	CA+TgmoYeS+RgQvnQEYNpA7JCjjd_0SkjSwF29Lrsy+vkGxcvrQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 5:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That still seems utterly astonishing to me. We're touching each of
> those files once per query cycle; a cycle that contains two message
> sends, who knows how many internal spinlock/lwlock/heavyweightlock
> acquisitions inside Postgres (some of which *do* contend with each
> other), and a not insignificant amount of plain old computing.
> Meanwhile, this particular spinlock inside the kernel is protecting
> what, a single doubleword fetch? How is that the bottleneck?

Spinlocks seem to have a very ugly "tipping point". When I tested
pgbench -S on a 64-core system with the lazy vxid patch applied and a
patch to use random_r() in lieu of random, the amount of system time
used per SELECT-only transaction at 48 clients was 3.59 times as much
as it was at 4 clients. And the amount used per transaction at 52
clients was 3.63 times the amount used per transaction at 48 clients.
And the amount used at 56 clients was 3.25 times the amount used at 52
clients. You can see the throughput graph starting to flatten out in
the 32-44 client range, but it's not particularly alarming. However,
once you pass that point things rapidly get totally out of control in
a real hurry. A few more clients and the machine is basically doing
nothing but spin.

> I am wondering whether kernel spinlocks are broken.

I don't think so. Stefan Kaltenbrunner had one profile where he
showed something like sixty or eighty percent of the usermode CPU time
in s_lock. I didn't have access to that particular hardware, but the
testing I've done strongly suggests that most of that was the
SInvalReadLock spinlock. And before I patched pgbench to avoid
calling random(), that was doing the same thing - literally flattening
a 64-core box fighting over a single futex that normally costs almost
nothing. (That one wasn't quite as bad because the futex actually
deschedules the waiters, but it was still bad.) I'm actually not
really sure why it shakes out this way (birthday paradox?) but having
seen the effect several times now, I'm disinclined to believe it's an
artifact.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-04 01:59:57
Message-ID:	CA+TgmobWi_tFQAFX13VryaW3ZoSxRxVQOebOPLb0SGNEeLZhuw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
>> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
>> because PostgreSQL protects its free-list with a single lock and thus
>> scales poorly with smaller caches." This is a complaint about
>> BufFreeList lock which, in fact, I've seen as a huge point of
>> contention on some workloads. In fact, on read-only workloads, with
>> my lazy vxid lock patch applied, this is, I believe, the only
>> remaining unpartitioned LWLock that is ever taken in exclusive mode;
>> or at least the only one that's taken anywhere near often enough to
>> matter. I think we're going to do something about this, although I
>> don't have a specific idea in mind at the moment.
>
> This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).
>
> The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.

Based on recent benchmarking, I'm going to say "no". It doesn't seem
to matter how short you make the critical section: a single
program-wide mutex is a loser. Furthermore, the "free list" is a
joke, because it's nearly always going to be completely empty. We
could probably just rip that out and use the clock sweep and never
miss it, but I doubt it would improve performance much.

I think what we probably need to do is have multiple clock sweeps in
progress at the same time. So, for example, if you have 8GB of
shared_buffers, you might have 8 mutexes, one for each GB. When a
process wants a buffer, it locks one of the mutexes and sweeps through
that 1GB partition. If it finds a buffer before returning to the
point at which it started the scan, it's done. Otherwise, it releases
its mutex, grabs the next one, and continues on until it finds a free
buffer.

The trick with any modification in this area is that pretty much any
degree of increased parallelism is potentially going to reduce the
quality of buffer replacement to some degree. So the trick will be to
try to squeeze out as much concurrency as possible while minimizing
degradation in the quality of buffer replacements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-04 18:03:59
Message-ID:	CA+TgmoYA+KGCRs49_xqq1Pep7yjUK55iEpkgTM3iw-+wfQXERQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 9:16 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Spinlocks seem to have a very ugly "tipping point".

And on that note, here are oprofile results from "pgbench -n -T 300 -S
-c 64 -j 64 -M prepared" on the latest master branch, compiled with
"-O2 -fno-omit-frame-pointer". shared_buffers=8GB, 64-core machine,
RHEL 6.1. By running with "-M prepared", it dodges the lseek()
problem.

960576 23.7580 postgres postgres s_lock
562821 13.9203 no-vmlinux no-vmlinux /no-vmlinux
321191 7.9440 postgres postgres
LWLockRelease
317653 7.8565 postgres postgres
LWLockAcquire
224812 5.5603 postgres postgres
GetSnapshotData
81156 2.0072 postgres postgres _bt_compare
78744 1.9476 postgres postgres PinBuffer
58101 1.4370 postgres postgres
hash_search_with_hash_value
43865 1.0849 postgres postgres
AllocSetAlloc
25832 0.6389 postgres postgres PostgresMain

Since SpinLockAcquire() is an in-line macro that only calls s_lock()
if the initial TAS fails, not only the time directly attributed to
s_lock but also a good chunk of the CPU time attributable to
LWLockAcquire and LWLockRelease() is likely time spent fighting over
spinlocks. Since I compiled with frame pointers, it's pretty easy to
see where those s_lock calls are coming from. Here's an excerpt from
opreport -c:

5 5.0e-04 postgres postgres _bt_getbuf
6 6.0e-04 postgres postgres
_bt_relandgetbuf
14 0.0014 postgres postgres
ReleaseAndReadBuffer
85 0.0085 postgres postgres
ReadBuffer_common
206 0.0207 postgres postgres
GetSnapshotData
18344 1.8437 postgres postgres
UnpinBuffer
24977 2.5103 postgres postgres PinBuffer
406948 40.9009 postgres postgres
LWLockRelease
544376 54.7133 postgres postgres
LWLockAcquire
994947 23.5746 postgres postgres s_lock

It's also fairly easy to track down who is calling LWLockAcquire and
LWLockRelease. Nearly all of the calls are from just two
contributors:

241655 27.6830 postgres postgres
ReadBuffer_common
566434 64.8885 postgres postgres
GetSnapshotData
328548 7.7847 postgres postgres
LWLockAcquire

176629 23.8917 postgres postgres
ReadBuffer_common
524348 70.9259 postgres postgres
GetSnapshotData
332333 7.8744 postgres postgres
LWLockRelease

So, most of the s_lock calls come from LWLockAcquire, and most of the
LWLockAcquire calls come from GetSnapshotData. That's not quite
enough to prove that all the spinning going on here is coming from
contention over the spinlock protecting ProcArrayLock, because it
needn't be the case that all calls to LWLockAcquire are equally likely
to end up in s_lock. You could speculate that ProcArrayLock isn't
actually responsible for many of those s_lock calls and that some
other lock, like maybe the buffer mapping locks, is disproportionately
responsible for the s_lock calls. But in fact I think it's exactly
the other way around: the buffer mapping locks are partitioned 16
ways, while there's only one ProcArrayLock. I'm willing to bet that's
where nearly all of the spinning is happening, and I'll further bet
that that spinning accounts for AT LEAST a third of the total CPU time
on this workload. And maybe closer to half.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-04 21:09:02
Message-ID:	CAC_2qU_4ftYZVg-e0s2Zvb5GOtyH-Vdg-aS+FsoSx5Tk2dD3Kg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> And hoping that the Linux guys decide to do something about it.
> This isn't really our bug - lseek is quite cheap in the uncontended
> case.

Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
they've finally remove the BKL out of VFS/inode?

I mean, complaining about scalability in linux 2.6.18 is like
complaining about scalability in postgresql 8.2 ;-)

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-04 22:22:07
Message-ID:	CA+TgmoYYzv1xxCfhtWYcuuMLitmVRZ58-Zhy0NtrcEY0k+oq+A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 4, 2011 at 5:09 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> And hoping that the Linux guys decide to do something about it.
>> This isn't really our bug - lseek is quite cheap in the uncontended
>> case.
>
> Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
> they've finally remove the BKL out of VFS/inode?
>
> I mean, complaining about scalability in linux 2.6.18 is like
> complaining about scalability in postgresql 8.2 ;-)

Hmm. This machine is running 2.6.32-131.6.1.el6.x86_64, not 2.6.18.
Not sure how much the code has changed since then, but the spinlock is
there in the master branch of Linus's repository.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-06 17:43:03
Message-ID:	CAMkU=1wv=r1eN0Q+mbPY7n2rgQrDS=Gs6ZGaz4ga+q+AtE_8PA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 11:21 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> About nine months ago, we had a discussion of some benchmarking that
> was done by the mosbench folks at MIT:
>
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php
>
> Although the authors used PostgreSQL as a test harness for driving
> load, it's pretty clear from reading the paper that their primary goal
> was to stress the Linux kernel, so the applicability of the paper to
> real-world PostgreSQL performance improvement is less than it might
> be. Still, having now actually investigated in some detail many of
> the same performance issues that they were struggling with, I have a
> much clearer understanding of what's really going on here. In
> PostgreSQL terms, here are the bottlenecks they ran into:
>
> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
> because PostgreSQL protects its free-list with a single lock and thus
> scales poorly with smaller caches." This is a complaint about
> BufFreeList lock which, in fact, I've seen as a huge point of
> contention on some workloads. In fact, on read-only workloads, with
> my lazy vxid lock patch applied, this is, I believe, the only
> remaining unpartitioned LWLock that is ever taken in exclusive mode;
> or at least the only one that's taken anywhere near often enough to
> matter. I think we're going to do something about this, although I
> don't have a specific idea in mind at the moment.

I was going to ask if you if had done any benchmarks with scale such
that the tables fit in RAM but not in shared_buffers. I guess you
have.

The attached experimental patch fixed freelist contention on 8 cores.
It would be nice to see what happens above that.

It has been cherry picked up to HEAD, but not tested against it. (Last
tested in Dec 2010, my how time flies)

The approach is to move the important things from a LWLock to a
spinlock, and to not do any locking for increments to clock-hand
increment and numBufferAllocs.
That means that some buffers might occasionally get inspected twice
and some might not get inspected at all during any given clock cycle,
but this should not lead to any correctness problems. (Disclosure:
Tom didn't like this approach when it was last discussed.)

I just offer this for whatever it is worth to you--I'm not proposing
it as an actual patch to be applied.

When data fits in RAM but not shared_buffers, maybe the easiest fix is
to increase shared_buffers. Which brings up the other question I had
for you about your work with Nate's celebrated loaner machine. Have
you tried to reproduce the performance problems that have been
reported (but without public disclosure of how to reproduce) with
shared_buffers > 8GB on machines with RAM >>8GB ?

Cheers,

Jeff

Attachment	Content-Type	Size
freelist.patch	text/x-patch	4.6 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-06 18:00:35
Message-ID:	CAMkU=1x=68iPHCXp8qAYGrLYGHHc=i4xHRUdGj-xN3xm1OxQ1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 3:21 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
>> 1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
>> because PostgreSQL protects its free-list with a single lock and thus
>> scales poorly with smaller caches." This is a complaint about
>> BufFreeList lock which, in fact, I've seen as a huge point of
>> contention on some workloads. In fact, on read-only workloads, with
>> my lazy vxid lock patch applied, this is, I believe, the only
>> remaining unpartitioned LWLock that is ever taken in exclusive mode;
>> or at least the only one that's taken anywhere near often enough to
>> matter. I think we're going to do something about this, although I
>> don't have a specific idea in mind at the moment.
>
> This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).
>
> The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.

Hi Jim,

My experiments have shown that the freelist proper is not
substantially faster than the freelist clocksweep--and that is even
under the assumption that putting things back into the freelist is
absolutely free. Under all the workload's I've been able to contrive,
other than ones contrived by actually hacking the code itself to make
it pathological, the average number of buffers inspected per run of
the clock sweep is <2.5. Under contention, the mere act of acquiring
a lock is more traumatic than the actual work carried out under the
lock.

Cheers,

Jeff

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-06 18:16:13
Message-ID:	m2bow2jthu.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It would be nice if the Linux guys would fix this problem for us, but
> I'm not sure whether they will. For those who may be curious, the
> problem is in generic_file_llseek() in fs/read-write.c. On a platform
> with 8-byte atomic reads, it seems like it ought to be very possible
> to read inode->i_size without taking a spinlock. A little Googling
> around suggests that some patches along these lines have been proposed
> and - for reasons that I don't fully understand - rejected. That now
> seems unfortunate. Barring a kernel-level fix, we could try to
> implement our own cache to work around this problem. However, any
> such cache would need to be darn cheap to check and update (since we
> can't assume that relation extension is an infrequent event) and must
> somehow having the same sort of mutex contention that's killing the
> kernel in this workload.

What about making the relation extension much less frequent? It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks. I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate. Maybe the
reltuples would do? If not, it could be that some adapting of its
accuracy could be done?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-06 19:30:30
Message-ID:	3648.1312659030@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> My experiments have shown that the freelist proper is not
> substantially faster than the freelist clocksweep--and that is even
> under the assumption that putting things back into the freelist is
> absolutely free.

The freelist isn't there to make buffer allocation faster, though;
it's there for allocation efficiency. The point is that when some
buffers have become completely useless (eg, because we dropped the table
they were for), they'll be recycled in preference to reclaiming buffers
that contain still-possibly-useful data. It would certainly be simple
to get rid of the freelist and only recycle dead buffers when the clock
sweep reaches them, but I think we'd be paying for that in extra,
unnecessary I/O.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-08 13:22:50
Message-ID:	CA+TgmoYF-+9fCJXr1nS3a4QSpj5JKt6Q_JjrmQRNECEj4HVYaQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> The approach is to move the important things from a LWLock to a
> spinlock, and to not do any locking for increments to clock-hand
> increment and numBufferAllocs.
> That means that some buffers might occasionally get inspected twice
> and some might not get inspected at all during any given clock cycle,
> but this should not lead to any correctness problems. (Disclosure:
> Tom didn't like this approach when it was last discussed.)
>
> I just offer this for whatever it is worth to you--I'm not proposing
> it as an actual patch to be applied.

Interesting approach.

> When data fits in RAM but not shared_buffers, maybe the easiest fix is
> to increase shared_buffers. Which brings up the other question I had
> for you about your work with Nate's celebrated loaner machine. Have
> you tried to reproduce the performance problems that have been
> reported (but without public disclosure of how to reproduce) with
> shared_buffers > 8GB on machines with RAM >>8GB ?

No. That's on my list, but thus far has not made it to the top of
said list. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-08 13:29:39
Message-ID:	CA+TgmoYN2hs4ewHjhcRUH5cCey-OBDSNX9KiEEsUwtEEiVwA-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It would be nice if the Linux guys would fix this problem for us, but
>> I'm not sure whether they will. For those who may be curious, the
>> problem is in generic_file_llseek() in fs/read-write.c. On a platform
>> with 8-byte atomic reads, it seems like it ought to be very possible
>> to read inode->i_size without taking a spinlock. A little Googling
>> around suggests that some patches along these lines have been proposed
>> and - for reasons that I don't fully understand - rejected. That now
>> seems unfortunate. Barring a kernel-level fix, we could try to
>> implement our own cache to work around this problem. However, any
>> such cache would need to be darn cheap to check and update (since we
>> can't assume that relation extension is an infrequent event) and must
>> somehow having the same sort of mutex contention that's killing the
>> kernel in this workload.
>
> What about making the relation extension much less frequent? It's been
> talked about before here, that instead of extending 8kB at a time we
> could (should) extend by much larger chunks. I would go as far as
> preallocating the whole next segment (1GB) (in the background) as soon
> as the current is more than half full, or such a policy.
>
> Then you have the problem that you can't really use lseek() anymore to
> guess'timate a relation size, but Tom said in this thread that the
> planner certainly doesn't need something that accurate. Maybe the
> reltuples would do? If not, it could be that some adapting of its
> accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill. If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

However, it doesn't really do anything to solve this problem. The
problem here is not that the size of the relation is changing too
frequently - indeed, it's not changing at all in this test case. The
problem is rather that testing whether or not the size has in fact
changed is costing too much.

The reason why we are testing the size of the relation here rather
than just using reltuples is because the relation might have been
extended since it was last analyzed. We can't count on analyze to run
often enough to avoid looking at the actual file size. If the file's
grown, we have to scale the number of tuples up proportional to the
growth in relpages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-08 13:49:27
Message-ID:	CA+TgmoZZwU-tyEOLgY+cPVNd6kC52gY0Z96cJA=icZsm=BH_aw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 6, 2011 at 3:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> My experiments have shown that the freelist proper is not
>> substantially faster than the freelist clocksweep--and that is even
>> under the assumption that putting things back into the freelist is
>> absolutely free.
>
> The freelist isn't there to make buffer allocation faster, though;
> it's there for allocation efficiency. The point is that when some
> buffers have become completely useless (eg, because we dropped the table
> they were for), they'll be recycled in preference to reclaiming buffers
> that contain still-possibly-useful data. It would certainly be simple
> to get rid of the freelist and only recycle dead buffers when the clock
> sweep reaches them, but I think we'd be paying for that in extra,
> unnecessary I/O.

This is an interesting point, because I think it really gets at the
heart of the trade-off we're facing here: quality of buffer allocation
vs. concurrency.

Assuming no free buffers are present, we'd ideally like to evict the
buffer that won't be used until the latest possible future time. LRU
is an approximation of this. Our clock-sweep algorithm is a
less-accurate approximation of this that pays for itself with less
locking. If we really wanted true LRU, we'd have to move things
around in the list every time a buffer was pinned. That would be
really expensive. Our clock sweep algorithm only requires fighting
over a piece of shared state when we want to kick a buffer out, rather
than every time we want to do anything to a buffer. And it's probably
only slightly worse in terms of buffer allocation efficiency than true
LRU, so on balance it appears to be a good trade-off.

Even so, I think it's pretty trivial to construct a workload where
performance is throttled by BufFreelistLock. I seriously doubt we're
going to be able to solve that problem by optimizing the code that
runs while holding that lock. We might by ourselves a few more
percentage points here or there, but if we really want to bust this
bottleneck we're going to need to move to some algorithm that is an
even-less-perfect approximation of LRU but which doesn't involve a
single cluster-wide lock that throttles all buffer allocation. No
matter how fast you make the critical section protected by that lock,
given enough CPUs, it will eventually not be fast enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-08 15:11:12
Message-ID:	4E3FFC90.5050708@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-08-08 15:29, Robert Haas wrote:
> On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine<dimitri(at)2ndquadrant(dot)fr> wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> It would be nice if the Linux guys would fix this problem for us, but
>>> I'm not sure whether they will. For those who may be curious, the
>>> problem is in generic_file_llseek() in fs/read-write.c. On a platform
>>> with 8-byte atomic reads, it seems like it ought to be very possible
>>> to read inode->i_size without taking a spinlock. A little Googling
>>> around suggests that some patches along these lines have been proposed
>>> and - for reasons that I don't fully understand - rejected. That now
>>> seems unfortunate. Barring a kernel-level fix, we could try to
>>> implement our own cache to work around this problem. However, any
>>> such cache would need to be darn cheap to check and update (since we
>>> can't assume that relation extension is an infrequent event) and must
>>> somehow having the same sort of mutex contention that's killing the
>>> kernel in this workload.
>> What about making the relation extension much less frequent? It's been
>> talked about before here, that instead of extending 8kB at a time we
>> could (should) extend by much larger chunks. I would go as far as
>> preallocating the whole next segment (1GB) (in the background) as soon
>> as the current is more than half full, or such a policy.
>>
>> Then you have the problem that you can't really use lseek() anymore to
>> guess'timate a relation size, but Tom said in this thread that the
>> planner certainly doesn't need something that accurate. Maybe the
>> reltuples would do? If not, it could be that some adapting of its
>> accuracy could be done?
> I think that pre-extending relations or extending them in larger
> increments is probably a good idea, although I think the AMOUNT of
> preallocation you just proposed would be severe overkill. If we
> extended the relation in 1MB chunks, we'd reduce the number of
> relation extensions by more than 99%, and with far less space wastage
> than the approach you are proposing.
Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-10 22:01:55
Message-ID:	81hb5pkjsc.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> However, it doesn't really do anything to solve this problem. The
> problem here is not that the size of the relation is changing too
> frequently - indeed, it's not changing at all in this test case. The
> problem is rather that testing whether or not the size has in fact
> changed is costing too much.

You were complaining about the cost of the cache maintenance, that in
the current scheme of things would have to be called far too often.
Reducing the relation extension trafic would allow, I guess, to have
something more expensive to reset the cache ― it would not happen much.

Now, it could be that the idea is only worth “the electrons it's written
on” if all the relation extensions are taken care of by a background
process...

> The reason why we are testing the size of the relation here rather
> than just using reltuples is because the relation might have been
> extended since it was last analyzed. We can't count on analyze to run
> often enough to avoid looking at the actual file size. If the file's
> grown, we have to scale the number of tuples up proportional to the
> growth in relpages.

Could we send the same message to the stat collector as autoanalyze is
doing each time we extend a relation?

--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-11 01:27:08
Message-ID:	CA+TgmoYLTNrgxopDyRB1C9czO0XDQ+n+4Po5Wf7Omb8uP6vZeQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/8/10 Dimitri Fontaine <dfontaine(at)hi-media(dot)com>:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> However, it doesn't really do anything to solve this problem. The
>> problem here is not that the size of the relation is changing too
>> frequently - indeed, it's not changing at all in this test case. The
>> problem is rather that testing whether or not the size has in fact
>> changed is costing too much.
>
> You were complaining about the cost of the cache maintenance, that in
> the current scheme of things would have to be called far too often.
> Reducing the relation extension trafic would allow, I guess, to have
> something more expensive to reset the cache -- it would not happen much.

That's an interesting point. I confess to having no idea how frequent
relation extension is now, or how much overhead we could add before it
started to get noticeable.

It seems likely to me that we can design something that is
sufficiently lightweight that we don't need to worry about reducing
the relation extension traffic first, but I don't know that for
certain.

>> The reason why we are testing the size of the relation here rather
>> than just using reltuples is because the relation might have been
>> extended since it was last analyzed. We can't count on analyze to run
>> often enough to avoid looking at the actual file size. If the file's
>> grown, we have to scale the number of tuples up proportional to the
>> growth in relpages.
>
> Could we send the same message to the stat collector as autoanalyze is
> doing each time we extend a relation?

Mmm, maybe. I don't really like the idea of getting the stats
collector involved in this; that seems like it will likely add
complexity without any corresponding benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: mosbench revisited
Date:	2011-08-11 01:46:19
Message-ID:	1313026966-sup-5698@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011:
> 2011/8/10 Dimitri Fontaine <dfontaine(at)hi-media(dot)com>:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> However, it doesn't really do anything to solve this problem. The
> >> problem here is not that the size of the relation is changing too
> >> frequently - indeed, it's not changing at all in this test case. The
> >> problem is rather that testing whether or not the size has in fact
> >> changed is costing too much.
> >
> > You were complaining about the cost of the cache maintenance, that in
> > the current scheme of things would have to be called far too often.
> > Reducing the relation extension trafic would allow, I guess, to have
> > something more expensive to reset the cache -- it would not happen much.
>
> That's an interesting point. I confess to having no idea how frequent
> relation extension is now, or how much overhead we could add before it
> started to get noticeable.

I have seen several cases on which it (rel extension) is really
troublesome. Think insertion on large bytea fields -- the toast table's
extension lock was heavily contended. When this was in 8.1 we had quite
some trouble because of the locking involving some shared buffer pool
lwlock which is fortunately now partitioned; even without that, it is a
major contention point. These were insert-only tables, so pages are
always full except the last one.

> >> The reason why we are testing the size of the relation here rather
> >> than just using reltuples is because the relation might have been
> >> extended since it was last analyzed. We can't count on analyze to run
> >> often enough to avoid looking at the actual file size. If the file's
> >> grown, we have to scale the number of tuples up proportional to the
> >> growth in relpages.
> >
> > Could we send the same message to the stat collector as autoanalyze is
> > doing each time we extend a relation?
>
> Mmm, maybe. I don't really like the idea of getting the stats
> collector involved in this; that seems like it will likely add
> complexity without any corresponding benefit.

Yeah, it adds delay and uncertainty (UDP being lossy).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: mosbench revisited
Date:	2011-08-11 04:07:03
Message-ID:	CA+TgmoahqGsg5Y8NxfC9zv7m8uumD-PDUAf-+=2JJeLErdRg6Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 10, 2011 at 9:46 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011:
>> 2011/8/10 Dimitri Fontaine <dfontaine(at)hi-media(dot)com>:
>> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> >> However, it doesn't really do anything to solve this problem. The
>> >> problem here is not that the size of the relation is changing too
>> >> frequently - indeed, it's not changing at all in this test case. The
>> >> problem is rather that testing whether or not the size has in fact
>> >> changed is costing too much.
>> >
>> > You were complaining about the cost of the cache maintenance, that in
>> > the current scheme of things would have to be called far too often.
>> > Reducing the relation extension trafic would allow, I guess, to have
>> > something more expensive to reset the cache -- it would not happen much.
>>
>> That's an interesting point. I confess to having no idea how frequent
>> relation extension is now, or how much overhead we could add before it
>> started to get noticeable.
>
> I have seen several cases on which it (rel extension) is really
> troublesome. Think insertion on large bytea fields -- the toast table's
> extension lock was heavily contended. When this was in 8.1 we had quite
> some trouble because of the locking involving some shared buffer pool
> lwlock which is fortunately now partitioned; even without that, it is a
> major contention point. These were insert-only tables, so pages are
> always full except the last one.

Yeah. I think there's a good argument to be made that we should
extend relations in larger chunks, both for this reason and to avoid
fragmenting the underlying file.

But in this case, the fact that relation extension is such a
heavyweight operation almost works to our advantage. I mean, if we're
already acquiring a heavyweight lock (and they're not called
heavyweight for nothing), making at least one system call which
updates filesystem metadata, and then releasing our heavyweight lock,
the additional overhead of setting a flag in shared memory or somesuch
might well be unnoticeable.

Or it might not - hard to know without testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-15 22:22:00
Message-ID:	CAM-w4HM5y0tx+-e4U+OkpQRPKgtobnhx37OUgi93anK1WvVSUQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm kind of interested by the
> result, actually, as I had feared that the spinlock protecting
> ProcArrayLock was going to be a bigger problem sooner.

I think this depends on how many connections you have. If you try to
scale up your benchmark by having hundreds of connections then get
O(n^2) increase in the time spent with the procarray locked. It sounds
like they pinned the number of connections at the number of cores they
had. That makes sense if they're intentionally driving a cpu-bound
benchmark but it means they won't run into this problem.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: mosbench revisited
Date:	2011-08-16 02:40:15
Message-ID:	CA+TgmoZvWAxowoD-teU3x7Vo_V_-m4NGhX8_k2uH+BzCfKnoUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 15, 2011 at 6:22 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I'm kind of interested by the
>> result, actually, as I had feared that the spinlock protecting
>> ProcArrayLock was going to be a bigger problem sooner.
>
> I think this depends on how many connections you have. If you try to
> scale up your benchmark by having hundreds of connections then get
> O(n^2) increase in the time spent with the procarray locked. It sounds
> like they pinned the number of connections at the number of cores they
> had. That makes sense if they're intentionally driving a cpu-bound
> benchmark but it means they won't run into this problem.

The experiments that I've done so far seem to indicate that there are
actually a couple of different ways for ProcArrayLock to become a
problem.

First, as you say, if you have a busy read-write workload, you just
get plain old LWLock contention. Everyone who wants a snapshot needs
the lock in shared mode; everyone who wants to commit needs it in
exclusive mode. So, the people who are committing block both
snapshot-taking and other commits; while the snapshot-takers block
commit.

If you switch to a read-only workload, ProcArrayLock becomes
uncontended, since a transaction without an XID does not need to
acquire ProcArrayLock to commit. But if you have enough CPUs (64 will
do it), the spinlock protecting ProcArrayLock becomes a bottleneck.
However, you can only observe this problem if the working set fits in
shared buffers; otherwise, BufFreelistLock contention slows the whole
system to a crawl, and the spinlock contention never materializes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company